[Cod-bugs] 2997 invalid files in C.O.D.

Sat Jul 8 17:35:44 EEST 2023

Dear David,

thank you for your e-mail and the list of issues that you have provided.

The feedback from the COD users, and of course that includes your 
feedback, is very valuable for us. We do our best to correct the COD 
entries if here are errors in them and to make COD as accurate as 
possible. In doing so we strictly stick to the definitions of the CIF 
provided by the IUCr and the best current practices we are aware of in 
crystallography. Sometimes, however, it is not possible to make all 
corrections that our users request. Below, I'll give my comments on the 
issues you raise.

On 2023-07-05 12:59, David Palmer wrote:
> Dear Colleagues,
>
> I send you a message a few weeks ago about my plans to provide easy 
> phase ID via C.O.D.-hosted structures. I haven’t heard back from you, 
> so I assume you have no objections.

I must admit that we have not received your previous mail; it is 
possible that the e-mail was lost on the way since we had some mail 
server failures in our university. In any case, from you current letter 
I understand that you would like to provide material identification 
software based on the COD and make it public. If this is so, them we 
have absolutely no objections for that, in fact he COD exists to make 
such projects possible! Of course please advise your users that they 
cite the original publications that produced data records in the COD if 
specific records are used, as is customary in scientific practice, and 
we would appreciate citation and reference of the COD as well, where 
relevant.

As a side note, we never abbreviate our database as the 'C.O.D.' (with 
periods); it is usually written as an initialism 'COD'.

>
> In the meantime, we have used our automated tools to analyse all 
> current structures files. I am attaching a summary, listing file IDs 
> and errors for 2,997 out of your 0.5M or so files: a relatively-small 
> figure (ca. 0.6%). However, these files are invalid, and cannot be 
> used for structural work, so I would recommend getting them fixed.
>
Thanks for providing the list of the files that failed processing, we 
will have a close look into them.

As a note, the term "valid" in the CIF framework has a quite specific 
meaning – it means that the structure CIFs are valid according to some 
declared CIF dictionaries. The invalid files may or may not be suitable 
for structural work, and may or may not be amenable for corrections.

Currently, three levels of checks are performed in the COD, with the 
following guarantees we provide:

- a syntax check. We guarantee that the CIFs from the COD are conformant 
to the syntax declared by the IUCr, using our CIF parser [1] and other 
parsers in the field. This ensures that the COD files can be processed 
in an automated way. Thus, if you spot a syntactically wrong file, 
please report it and we will fix that immediately; the file has to be 
checked against the IUCr CIF grammar.

- a dictionary check. The files that validate against the IUCr 
dictionaries are using the data elements in an intended way. Though many 
files in the COD are indeed valid in this sense, a substantial portion 
of them raises one or several validation issues (we compiled over 11 
mln. validation messages from the current COD collection). We look into 
them and search for systematic ways to correct the most serious ones, 
but this is an on-going work and the full validity can not be 
practically achieved at the moment;

- we do certain COD specific checks (e.g. checking that all three 
coordinate data items, _atom_site_fract_{x,y,z} are present). This is 
supposed to catch most obvious mistakes in the data files, but can only 
be used for improving the COD records if we get hold on correct original 
data.

Before we go into more details about the issues you report, let me draw 
you attention to one feature of the CIF framework that will be important:

the CIF files MAY (as in RFC 2119) contain special values '?' and '.' 
(without the quotes) as values for any data item in the file. The files 
that contain such values are both syntactically correct and valid in the 
sense defined above (i.e. such values validate against CIF 
dictionaries). The '?' value, as we understand it, denotes that the 
actual value of the data item is not know (but may become known in the 
future). The value '.' denotes that the value is not relevant, or does 
not exist at all. We sometimes use these values to indicate special 
situations in the COD files; they can also be used as atom coordinate 
values. Any CIF compliant software should be prepared to deal with such 
values.

> The most common errors are:
>
> - missing fractional coordinates

There are several occasions when coordinate values are missing; let me 
illustrate them from the list the you have provided:

- 2217080: this entry contains '.' as atom coordinates for a serious 
reason: the structure that was published in a peer-reviewed article 
appeared to be fake and was retracted. To avoid erroneous calculations, 
the original coordinate values were replaced by '.', marking them as 
irrelevant, and the entry is marked as retracted. It is retained in the 
COD database as a historic record and to prevent its renewed deposition. 
The exact reasons for retraction are documented in the COD CIF file, and 
the references to relevant IUCr editorials are given.

You may want to filter out retracted entries, either by checking the 
'_cod_entry_issue_severity' data item or by querying status in our SQL 
database:

> mysql -u cod_reader -h sql.crystallography.net cod -e 'select file 
> from data where status not like "%retracted%" or status is NULL'

There are more flags that you may want to filter out (suboptimal 
structures, duplicates, structures without coordinates, structures with 
warnings, etc.); please check our Wiki from the COD Web page for full 
documentation.

- 1000195: this entry contains '?' as coordinates, indicating that they 
are unknown. Looking at the publication year (1962) I realise that this 
is the very old publication; we do not have the paper at hand, and it is 
also likely that the coordinates were not reported for some compounds at 
these dates, only cell parameters.

The COD entries of this kind are provided to indicated that the 
publication existed, and to provide the information currently known 
(cell parameters, chemical composition, crystal symmetry). This 
information is already enough for some kinds of computations (e.g. as 
initial approximations for DFT).

If we ever get the original publication and the coordinates are 
published there, we will insert them in the new revision of this entry. 
If you have access to the original publication, we would be grateful if 
you share it (or the updated CIF ;) with us.

- 5900030: in this entry, the x coordinate has values '.' since these 
values were not determined in the original publication; while physically 
the x coordinate is defined for the structure, it is not available from 
this particular publication (i.e. we have no chance to recover it from 
published data). Other data values, such as cell constants and the y-z 
coordinates of the projection are available and can be used.

> - ambiguous site labelling

I am not quite sure what problem you mean there. One known issue is that 
some structures do have duplicate atom labels. This is an error, and we 
will fix it with time. This involves a fair amount of manual checking 
however, so I can not promise we do it fast.

For the moment, a possible workaround would be to add unique suffix to 
such atom labels during the structure interpretation and then process 
the structure as usual.

> - invalid element symbols

This is a known issue, especially with atoms from AMCSD that have custom 
labelling scheme.

Fortunately, the new version of AMCSD has a new consistent atom naming, 
and we could assign atom types semi-automatically for these entries. 
Incidently, I have just finished analysis and assignments of atom types 
to those entries.

Please check out the COD revision 285101 – it should have most of the 
atoms with the correct types assigned. As per my checks, only 45 COD 
entries remain that still have unrecognised atom types (if you take 
_atom_site_type_label into account, of course). Some of these are indeed 
unknown atoms, such as metal sites with uncertain identity.

Please let us know how this revision scores with your software!

>
> A common issue is a mismatch between site labels in different data 
> blocks (e.g., a table of anisotropic displacement parameters and a 
> table of fractional coordinates).
Just a bit of nit-picking on terminology – all COD files contain just 
one data block (it starts with a unique data_... header in each CIF). 
ADPs and coordinates are usually located in different /loops/ in the 
same data block.
> We found these errors in numerous files submitted via the *American 
> Mineralogist crystal structures database* (clearly, substantial 
> amounts of U.S. governmental funding failed to prevent basic 
> transcription errors!)

To all fairness, I would say that Bob Downs and his team make a good job 
collecting all minerals; without AMCSD contribution, our COD collection 
of minerals would have been much shabbier. They are constantly improving 
their collection (I'm constantly in touch with Bob on these matters), 
and heir recent work enabled us to assign atom types with reasonable 
work effort. As for the funding, I'm not sure if they get substantial 
amounts of it; I am aware of several startup grants they had, and I 
think they used them as good as they could.

This does not mean that the matters can not be improved :), and we are 
working on that as well. The discrepancy of the labels in the Uij and 
xyz loops is a known issue that appeared in the recent update. We are 
working with Bob to rectify this, but this will take some while. In 
between, I have a suggestion of a workaround below:

>
> Take the following file, 9003355, as an example:-
>
> • Sites SiT1’, AlT1’ (etc.) are listed in the loop containing Uij
> • The same site are labelling differently (e.g., SiT1*, AlT1*, etc.) 
> in the loop containing xyz
>
> Whilst, to a human, one could make inferences as to how these labels 
> should be related, a computer cannot make such a judgement, thereby 
> rendering these files useless.

I agree that humans can match the labels, and potentially fix them; we 
have no manpower however to go through these lists manually, and even 
then the manual editing would be error-prone. We could apply a 
heuristics that an apostrophe ("'") in one loop corresponds to the 
asterisk ("*") in the other loop and make an automatic correction, but 
the results still needs to be checked manually (I am reluctant to commit 
to the COD changes that are based on broad guesses); also, there are 
some other patterns in place (e.g. 'OH' vs 'O-H' change in labels).

 From the error messages in the log file that you sent us, I have 
impression that your program looks for an atom label in the 
_atom_site_aniso_label (aka Uij) loop, and then tries to find the 
corresponding _atom_site_label in the coordinate loop. This will fail 
not only when the labels do not match but also when the atom is not 
mentioned in the _atom_site_aniso_label loop /at all/. Since not all 
atoms are refined anisotropically, some of them can be legitimately left 
out  from the Uij loop, but have them in the _atom_site_fract_x loop; 
such files are perfectly valid and usable.

May I suggest a workaround for the processing of such files – let's to 
look first in the coordinate loop for the _atom_site_label to identify 
all atoms, and then look up the anisotropic displacement parameters Uij 
in the _atom_site_aniso_labelloop if they exist. If they do not, it is 
often possible to use Uiso instead, and I bet this will be a fair 
approximation even for anisotropically refined atoms. In this way you 
will correctly process all correct files and have a reasonable 
approximate data for the files that are currently mislabelled. In the 
future we will correct the Uij<->xyz label correspondence (our validator 
detects them), and you can then recalculate your outputs with the new 
COD revision, getting more accurate results. I can let you know when 
such revision is issued in the COD, but please ping me after some time 
since I can forget :)

Of course one can also apply the heuristics mentioned above, or skip 
such entries with mismatches altogether, until the new COD revision is 
in place.

Hope this clarifies the COD data contents and the way we address the 
detected problems.

Once more thank you for your report!

>
> I hope this helps, and do let me know if you have any questions.
>
> With best wishes,
> Yours faithfully,
>
> David Palmer
>
> David C Palmer, Ph.D. (Cantab), M.A. (Cantab),
> Managing Director, CrystalMaker Software Ltd
> Centre for Innovation & Enterprise |  Oxford University Begbroke 
> Science Park
> Woodstock Road, Begbroke, Oxfordshire, OX5 1PF, UK
>
Sincerely yours,
Saulius

References:

[1] Merkys, A.; Vaitkus, A.; Butkus, J.; Okulič-Kazarinas, M.; Kairys, 
V. & Gražulis, S.
/COD::CIF::Parser/: an error-correcting CIF parser for the Perl language.
/Journal of Applied Crystallography,/*2016*/, 49/, 292-301, DOI: 
https://doi.org/10.1107/S1600576715022396

-- 
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230708/b623e89b/attachment-0001.htm>