[Cod-bugs] I found some errors in CIFs

Saulius Gražulis grazulis at ibt.lt
Wed Sep 14 10:51:29 EEST 2022


Dear Matthew!

Thank you for reporting errors in the COD database, and sorry for not 
responding immediately. It too us some time to look into the issues.

The problems are indeed serious, and I'll address on how to deal with 
them in the letter below:

On 2022-08-13 12:55, Matthew Rowles wrote:
> Hi COD people
>
> I was playing around with some ideas this week, and when parsing all 
> the CIFs, I got the following errors from my program. They will all 
> relate to errors in how the values of atom coordinates, atomic 
> displacement parameters, or cell parameters are given.
> I've looked at most of them, and they're fairly obvious in their 
> error. In all cases, this is just the first parsing error encountered; 
> in most files there are more than one.
>
> What is the best way to go about fixing them?
>
The problems that you list should be fixed whenever possible. Actually, 
there is an ongoing effort to use CIF validator to find problems of the 
similar kind and many others; CIF validator [1] is run on the whole COD 
database, and issues are recorded in separate DB tables [2]. The problem 
is that there is over 11 mln. (!) validation messages; we are still in 
the process of triaging them and setting up the data curation queue.

In this context your feedback is extremely helpful, since you tell us 
what is important for you as the COD user, and we can now set these 
issues to the high priority!

Fixing COD records.
----------------------------

When fixing COD records in the process of data curation, we stick to one 
main principle: "do not invent data". This basically means that we (as a 
community ;) should only change the COD data record if there is positive 
evidence that the author(s) of the structure indeed meant what we are 
inserting into the COD CIF. So, for instance, we would only change value 
'.07.9' in the COD 2010104 entry to '0.079' if the original paper text 
mentions somewhere in the text, of there are derived data (say bond 
lengths) that can only be correct if the number in question is '0.079'. 
Otherwise we do not know if the number was '7.9' or '0.07(9)' or 
something like that. Likewise, in the entry 4065647, the number 
'0.11152)' could be actually '0.1115(2)' or '0.111(52)', and we do not 
know which without further investigation.

All changes are documented within the corresponding COD CIF. The older 
data item was _cod_depositor_comments with a free-format text as its 
value with the description of the changes and issues; more recently we 
are switching to more structured comments in the cod_entry_issue_... and 
_cod_changelog_entry_... loops.

As an example, I can show two entries from your list which I have fixed 
recently [3,4]. I have inspected the supplementary file provided with 
the original publication and found a comment that said: "Starred atoms 
were refined isotropically" next to the atom coordinate loops. As there 
was indeed no U_ij data items for the "starred" atoms, I concluded that 
we can simply remove the '*' characters from the values to make them 
correct CIFs. The information about isotropic refinement is not lost 
since it can be inferred from the missing U_ij values, and is also 
documented in the COD Subversion revision history and in the file itself.

The downside of this policy is that it takes time and we can not promise 
fast fixes (it took me about an hour to investigate and fix these two 
entries [3,4]. The good thing is that the COD will never mislead the 
users and distort the original authors' claims if we strictly follow the 
above rules. Since COD is intended for scientific research we think that 
correct representation of the scientific data is the ultimate priority.

So essentially there are the following two ways to deal with the issues 
you have found:

a) you let us fix the issues in our (slow) pace; eventually, I hope, 
they will end up in the main COD repository;

b) you fix the COD records using the above principles and send them to 
us; we check them and commit them to the COD repository (under our name, 
if you wish to get a credit ;). Please make sure that you have 
corroborating evidence for the changes in the comments section; you can 
either use the cod_entry_issue_... and _cod_changelog_entry_... loops as 
in [3,4] (which is preferable) or, if that seems too complicated, you 
can use the older _cod_depositor_comments tag to describe your changes 
as in the older 7026981 example [5].

If there is no evidence in the original publication as to what correct 
values should be inserted into the text, we /should not/ fix them 
(unless we contact the author of the structure and the he or she 
confirms the change). In that case, we can only mark the COD entry with 
the _cod_entry_issue_... data elements and set severity to 'warning' or 
'error'. Your software should then skip such entries if it can not 
process them.

For your computations, you can apply local fixes (e.g. in form of 
patches, as used by the Unix 'patch' command) immediately before 
processing; or you can ignore COD records with errors (which you 
probably do ;). In both cases your procedures and workflow will be 
properly documented, so I do not see any problems with scientific 
reproducibility if you use patched COD records on your side, provided 
the patch files are recorded in your workflow.

I hope this clarifies the situation with buggy COD records. 
Unfortunately we still do have them – such was the state of the art in 
scientific data publishing when these data files were produced...

Sincerely yours,
Sauliu

> Regards
>
> Matthew
>
> file 	msg
> cif\7026981.cif 	could not convert string to float: '6.5*'
> cif\2003944.cif 	could not convert string to float: 'O'
> cif\2002897.cif 	could not convert string to float: '-'
> cif\7027704.cif 	could not convert string to float: '0..049'
> cif\7217107.cif 	could not convert string to float: '-'
> cif\1516392.cif 	could not convert string to float: '0..0498'
> cif\2001164.cif 	could not convert string to float: '.093.'
> cif\2010104.cif 	could not convert string to float: '.07.9'
> cif\4115528.cif 	could not convert string to float: '0..076'
> cif\2001895.cif 	could not convert string to float: '.0.828'
> cif\2003945.cif 	could not convert string to float: 'O'
> cif\2007662.cif 	could not convert string to float: '0.0634*'
> cif\2000499.cif 	could not convert string to float: '.0479{6)'
> cif\4065647.cif 	could not convert string to float: '0.11152)'
> cif\4321917.cif 	could not convert string to float: '0.0)'
> cif\4112884.cif 	could not convert string to float: '0..0460'
> cif\2101447.cif 	could not convert string to float: '..114'
> cif\7012794.cif 	could not convert string to float: '.0.053'
> cif\4321125.cif 	could not convert string to float: ':0.0321'
> cif\7026982.cif 	could not convert string to float: '6*'
> cif\2006442.cif 	could not convert string to float: '0..0770'
> cif\2010442.cif 	could not convert string to float: '0.08*'
> cif\7009740.cif 	could not convert string to float: '0..0307'
> cif\2003869.cif 	could not convert string to float: '.0458^a^'
> cif\2003381.cif 	could not convert string to float: '6*'
> cif\7051472.cif 	could not convert string to float: '0..066'
> cif\2101014.cif 	could not convert string to float: '.o39'
> cif\2003699.cif 	could not convert string to float: '4.0*'
> cif\2206732.cif 	could not convert string to float: '0.029m6'
> cif\2001302.cif 	could not convert string to float: '0.047*'
> cif\2006197.cif 	could not convert string to float: '0..076'
> cif\4085045.cif 	could not convert string to float: '0..059'
> cif\2007409.cif 	could not convert string to float: '0..0535'
> cif\2009533.cif 	could not convert string to float: '/'
> cif\2007301.cif 	could not convert string to float: ''
> cif\4320747.cif 	could not convert string to float: 'H'
> cif\4320905.cif 	could not convert string to float: '0..0629'
> cif\2007048.cif 	could not convert string to float: '0.0393*'
> cif\4322860.cif 	could not convert string to float: '0..055'
> cif\2003192.cif 	could not convert string to float: '0.08*'
> cif\4322021.cif 	could not convert string to float: '0..073'
> cif\2006384.cif 	could not convert string to float: '0..131'
> cif\4315383.cif 	could not convert string to float: '0.0.99'
> cif\4300081.cif 	could not convert string to float: '0.l09'
> cif\7027965.cif 	could not convert string to float: '0..037'
> cif\4061854.cif 	could not convert string to float: '0._61'
> cif\4300051.cif 	could not convert string to float: '0..0565'
> cif\2005964.cif 	could not convert string to float: '4.4*'
> cif\2009883.cif 	could not convert string to float: '.046.5'
> cif\2007654.cif 	could not convert string to float: '0.0494*'
> cif\4323741.cif 	could not convert string to float: '0..0283'
> cif\4322390.cif 	could not convert string to float: '0..115'
> cif\7111128.cif 	could not convert string to float: '0..091'
> cif\2005961.cif 	could not convert string to float: "C1'"
> cif\4333084.cif 	could not convert string to float: '0.09817)'
> cif\2009384.cif 	could not convert string to float: 'H91'
> cif\4320033.cif 	could not convert string to float: '..0794'
> cif\7027705.cif 	could not convert string to float: '0..056'
> cif\2002898.cif 	could not convert string to float: '-'
> cif\4322414.cif 	could not convert string to float: '0..0266'
> cif\3500007.cif 	could not convert string to float: '0.O1D'
> cif\2007253.cif 	could not convert string to float: '0..0684'
> cif\2005907.cif 	could not convert string to float: '0.0962*'
> cif\4115907.cif 	could not convert string to float: '0..0486'
> cif\4322396.cif 	could not convert string to float: '0..042'
> cif\7009739.cif 	could not convert string to float: '0..0477'
> cif\2101312.cif 	could not convert string to float: '6.0*'
> cif\7702612.cif 	could not convert string to float: '16*'
>
>
> -- 
> This message has been scanned for viruses and
> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
> believed to be clean.
>
> _______________________________________________
> Cod-bugs mailing list
> Cod-bugs at lists.crystallography.net
> http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs

References:

[1] Vaitkus, A.; Merkys, A. & Gražulis, S. Validation of the 
Crystallography Open Database using the Crystallographic Information 
Framework. Journal of Applied Crystallography, International Union of 
Crystallography (IUCr), 2021, 54, 1-12 DOI: 
https://doi.org/10.1107/s1600576720016532

[2] Vaitkus, A. COD validation issue database. 2021, URL: 
http://sql.crystallography.net/db/cod_validation [accessed 
2022-09-14T10:07+03:00]. NOTE: the page is slow to load, please be patient!

[3] http://www.crystallography.net/cod/7026981.html [accessed 
2022-09-14T10:18+03:00]

[4] http://www.crystallography.net/cod/7026982.html [accessed 
2022-09-14T10:18+03:00]

[5] Older revision of the COD 7026981 entry, showing the use of the 
_cod_depositor_comments data item. URL: 
http://www.crystallography.net/cod/7026981.cif@277790 [accessed 
2022-09-14T10:35+03:00]

-- 
Dr. Saulius Gražulis
Vilnius University Institute of Biotechnology, Saulėtekio al. 7
LT-10257 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353
mobile: (+370-684)-49802, (+370-614)-36366

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20220914/e2cac826/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: grazulis.vcf
Type: text/vcard
Size: 4 bytes
Desc: not available
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20220914/e2cac826/attachment-0001.vcf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20220914/e2cac826/attachment-0001.sig>


More information about the Cod-bugs mailing list