[Cod-bugs] 2997 invalid files in C.O.D.
Saulius Gražulis
grazulis at ibt.lt
Sat Jul 8 17:35:44 EEST 2023
Dear David,
thank you for your e-mail and the list of issues that you have provided.
The feedback from the COD users, and of course that includes your
feedback, is very valuable for us. We do our best to correct the COD
entries if here are errors in them and to make COD as accurate as
possible. In doing so we strictly stick to the definitions of the CIF
provided by the IUCr and the best current practices we are aware of in
crystallography. Sometimes, however, it is not possible to make all
corrections that our users request. Below, I'll give my comments on the
issues you raise.
On 2023-07-05 12:59, David Palmer wrote:
> Dear Colleagues,
>
> I send you a message a few weeks ago about my plans to provide easy
> phase ID via C.O.D.-hosted structures. I haven’t heard back from you,
> so I assume you have no objections.
I must admit that we have not received your previous mail; it is
possible that the e-mail was lost on the way since we had some mail
server failures in our university. In any case, from you current letter
I understand that you would like to provide material identification
software based on the COD and make it public. If this is so, them we
have absolutely no objections for that, in fact he COD exists to make
such projects possible! Of course please advise your users that they
cite the original publications that produced data records in the COD if
specific records are used, as is customary in scientific practice, and
we would appreciate citation and reference of the COD as well, where
relevant.
As a side note, we never abbreviate our database as the 'C.O.D.' (with
periods); it is usually written as an initialism 'COD'.
>
> In the meantime, we have used our automated tools to analyse all
> current structures files. I am attaching a summary, listing file IDs
> and errors for 2,997 out of your 0.5M or so files: a relatively-small
> figure (ca. 0.6%). However, these files are invalid, and cannot be
> used for structural work, so I would recommend getting them fixed.
>
Thanks for providing the list of the files that failed processing, we
will have a close look into them.
As a note, the term "valid" in the CIF framework has a quite specific
meaning – it means that the structure CIFs are valid according to some
declared CIF dictionaries. The invalid files may or may not be suitable
for structural work, and may or may not be amenable for corrections.
Currently, three levels of checks are performed in the COD, with the
following guarantees we provide:
- a syntax check. We guarantee that the CIFs from the COD are conformant
to the syntax declared by the IUCr, using our CIF parser [1] and other
parsers in the field. This ensures that the COD files can be processed
in an automated way. Thus, if you spot a syntactically wrong file,
please report it and we will fix that immediately; the file has to be
checked against the IUCr CIF grammar.
- a dictionary check. The files that validate against the IUCr
dictionaries are using the data elements in an intended way. Though many
files in the COD are indeed valid in this sense, a substantial portion
of them raises one or several validation issues (we compiled over 11
mln. validation messages from the current COD collection). We look into
them and search for systematic ways to correct the most serious ones,
but this is an on-going work and the full validity can not be
practically achieved at the moment;
- we do certain COD specific checks (e.g. checking that all three
coordinate data items, _atom_site_fract_{x,y,z} are present). This is
supposed to catch most obvious mistakes in the data files, but can only
be used for improving the COD records if we get hold on correct original
data.
Before we go into more details about the issues you report, let me draw
you attention to one feature of the CIF framework that will be important:
the CIF files MAY (as in RFC 2119) contain special values '?' and '.'
(without the quotes) as values for any data item in the file. The files
that contain such values are both syntactically correct and valid in the
sense defined above (i.e. such values validate against CIF
dictionaries). The '?' value, as we understand it, denotes that the
actual value of the data item is not know (but may become known in the
future). The value '.' denotes that the value is not relevant, or does
not exist at all. We sometimes use these values to indicate special
situations in the COD files; they can also be used as atom coordinate
values. Any CIF compliant software should be prepared to deal with such
values.
> The most common errors are:
>
> - missing fractional coordinates
There are several occasions when coordinate values are missing; let me
illustrate them from the list the you have provided:
- 2217080: this entry contains '.' as atom coordinates for a serious
reason: the structure that was published in a peer-reviewed article
appeared to be fake and was retracted. To avoid erroneous calculations,
the original coordinate values were replaced by '.', marking them as
irrelevant, and the entry is marked as retracted. It is retained in the
COD database as a historic record and to prevent its renewed deposition.
The exact reasons for retraction are documented in the COD CIF file, and
the references to relevant IUCr editorials are given.
You may want to filter out retracted entries, either by checking the
'_cod_entry_issue_severity' data item or by querying status in our SQL
database:
> mysql -u cod_reader -h sql.crystallography.net cod -e 'select file
> from data where status not like "%retracted%" or status is NULL'
There are more flags that you may want to filter out (suboptimal
structures, duplicates, structures without coordinates, structures with
warnings, etc.); please check our Wiki from the COD Web page for full
documentation.
- 1000195: this entry contains '?' as coordinates, indicating that they
are unknown. Looking at the publication year (1962) I realise that this
is the very old publication; we do not have the paper at hand, and it is
also likely that the coordinates were not reported for some compounds at
these dates, only cell parameters.
The COD entries of this kind are provided to indicated that the
publication existed, and to provide the information currently known
(cell parameters, chemical composition, crystal symmetry). This
information is already enough for some kinds of computations (e.g. as
initial approximations for DFT).
If we ever get the original publication and the coordinates are
published there, we will insert them in the new revision of this entry.
If you have access to the original publication, we would be grateful if
you share it (or the updated CIF ;) with us.
- 5900030: in this entry, the x coordinate has values '.' since these
values were not determined in the original publication; while physically
the x coordinate is defined for the structure, it is not available from
this particular publication (i.e. we have no chance to recover it from
published data). Other data values, such as cell constants and the y-z
coordinates of the projection are available and can be used.
> - ambiguous site labelling
I am not quite sure what problem you mean there. One known issue is that
some structures do have duplicate atom labels. This is an error, and we
will fix it with time. This involves a fair amount of manual checking
however, so I can not promise we do it fast.
For the moment, a possible workaround would be to add unique suffix to
such atom labels during the structure interpretation and then process
the structure as usual.
> - invalid element symbols
This is a known issue, especially with atoms from AMCSD that have custom
labelling scheme.
Fortunately, the new version of AMCSD has a new consistent atom naming,
and we could assign atom types semi-automatically for these entries.
Incidently, I have just finished analysis and assignments of atom types
to those entries.
Please check out the COD revision 285101 – it should have most of the
atoms with the correct types assigned. As per my checks, only 45 COD
entries remain that still have unrecognised atom types (if you take
_atom_site_type_label into account, of course). Some of these are indeed
unknown atoms, such as metal sites with uncertain identity.
Please let us know how this revision scores with your software!
>
> A common issue is a mismatch between site labels in different data
> blocks (e.g., a table of anisotropic displacement parameters and a
> table of fractional coordinates).
Just a bit of nit-picking on terminology – all COD files contain just
one data block (it starts with a unique data_... header in each CIF).
ADPs and coordinates are usually located in different /loops/ in the
same data block.
> We found these errors in numerous files submitted via the *American
> Mineralogist crystal structures database* (clearly, substantial
> amounts of U.S. governmental funding failed to prevent basic
> transcription errors!)
To all fairness, I would say that Bob Downs and his team make a good job
collecting all minerals; without AMCSD contribution, our COD collection
of minerals would have been much shabbier. They are constantly improving
their collection (I'm constantly in touch with Bob on these matters),
and heir recent work enabled us to assign atom types with reasonable
work effort. As for the funding, I'm not sure if they get substantial
amounts of it; I am aware of several startup grants they had, and I
think they used them as good as they could.
This does not mean that the matters can not be improved :), and we are
working on that as well. The discrepancy of the labels in the Uij and
xyz loops is a known issue that appeared in the recent update. We are
working with Bob to rectify this, but this will take some while. In
between, I have a suggestion of a workaround below:
>
> Take the following file, 9003355, as an example:-
>
> • Sites SiT1’, AlT1’ (etc.) are listed in the loop containing Uij
> • The same site are labelling differently (e.g., SiT1*, AlT1*, etc.)
> in the loop containing xyz
>
> Whilst, to a human, one could make inferences as to how these labels
> should be related, a computer cannot make such a judgement, thereby
> rendering these files useless.
I agree that humans can match the labels, and potentially fix them; we
have no manpower however to go through these lists manually, and even
then the manual editing would be error-prone. We could apply a
heuristics that an apostrophe ("'") in one loop corresponds to the
asterisk ("*") in the other loop and make an automatic correction, but
the results still needs to be checked manually (I am reluctant to commit
to the COD changes that are based on broad guesses); also, there are
some other patterns in place (e.g. 'OH' vs 'O-H' change in labels).
From the error messages in the log file that you sent us, I have
impression that your program looks for an atom label in the
_atom_site_aniso_label (aka Uij) loop, and then tries to find the
corresponding _atom_site_label in the coordinate loop. This will fail
not only when the labels do not match but also when the atom is not
mentioned in the _atom_site_aniso_label loop /at all/. Since not all
atoms are refined anisotropically, some of them can be legitimately left
out from the Uij loop, but have them in the _atom_site_fract_x loop;
such files are perfectly valid and usable.
May I suggest a workaround for the processing of such files – let's to
look first in the coordinate loop for the _atom_site_label to identify
all atoms, and then look up the anisotropic displacement parameters Uij
in the _atom_site_aniso_labelloop if they exist. If they do not, it is
often possible to use Uiso instead, and I bet this will be a fair
approximation even for anisotropically refined atoms. In this way you
will correctly process all correct files and have a reasonable
approximate data for the files that are currently mislabelled. In the
future we will correct the Uij<->xyz label correspondence (our validator
detects them), and you can then recalculate your outputs with the new
COD revision, getting more accurate results. I can let you know when
such revision is issued in the COD, but please ping me after some time
since I can forget :)
Of course one can also apply the heuristics mentioned above, or skip
such entries with mismatches altogether, until the new COD revision is
in place.
Hope this clarifies the COD data contents and the way we address the
detected problems.
Once more thank you for your report!
>
> I hope this helps, and do let me know if you have any questions.
>
> With best wishes,
> Yours faithfully,
>
> David Palmer
>
> David C Palmer, Ph.D. (Cantab), M.A. (Cantab),
> Managing Director, CrystalMaker Software Ltd
> Centre for Innovation & Enterprise | Oxford University Begbroke
> Science Park
> Woodstock Road, Begbroke, Oxfordshire, OX5 1PF, UK
>
Sincerely yours,
Saulius
References:
[1] Merkys, A.; Vaitkus, A.; Butkus, J.; Okulič-Kazarinas, M.; Kairys,
V. & Gražulis, S.
/COD::CIF::Parser/: an error-correcting CIF parser for the Perl language.
/Journal of Applied Crystallography,/*2016*/, 49/, 292-301, DOI:
https://doi.org/10.1107/S1600576715022396
--
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230708/b623e89b/attachment-0001.htm>
More information about the Cod-bugs
mailing list