[Cod-bugs] I found some errors in CIFs
Saulius Gražulis
grazulis at ibt.lt
Wed Sep 14 10:51:29 EEST 2022
Dear Matthew!
Thank you for reporting errors in the COD database, and sorry for not
responding immediately. It too us some time to look into the issues.
The problems are indeed serious, and I'll address on how to deal with
them in the letter below:
On 2022-08-13 12:55, Matthew Rowles wrote:
> Hi COD people
>
> I was playing around with some ideas this week, and when parsing all
> the CIFs, I got the following errors from my program. They will all
> relate to errors in how the values of atom coordinates, atomic
> displacement parameters, or cell parameters are given.
> I've looked at most of them, and they're fairly obvious in their
> error. In all cases, this is just the first parsing error encountered;
> in most files there are more than one.
>
> What is the best way to go about fixing them?
>
The problems that you list should be fixed whenever possible. Actually,
there is an ongoing effort to use CIF validator to find problems of the
similar kind and many others; CIF validator [1] is run on the whole COD
database, and issues are recorded in separate DB tables [2]. The problem
is that there is over 11 mln. (!) validation messages; we are still in
the process of triaging them and setting up the data curation queue.
In this context your feedback is extremely helpful, since you tell us
what is important for you as the COD user, and we can now set these
issues to the high priority!
Fixing COD records.
----------------------------
When fixing COD records in the process of data curation, we stick to one
main principle: "do not invent data". This basically means that we (as a
community ;) should only change the COD data record if there is positive
evidence that the author(s) of the structure indeed meant what we are
inserting into the COD CIF. So, for instance, we would only change value
'.07.9' in the COD 2010104 entry to '0.079' if the original paper text
mentions somewhere in the text, of there are derived data (say bond
lengths) that can only be correct if the number in question is '0.079'.
Otherwise we do not know if the number was '7.9' or '0.07(9)' or
something like that. Likewise, in the entry 4065647, the number
'0.11152)' could be actually '0.1115(2)' or '0.111(52)', and we do not
know which without further investigation.
All changes are documented within the corresponding COD CIF. The older
data item was _cod_depositor_comments with a free-format text as its
value with the description of the changes and issues; more recently we
are switching to more structured comments in the cod_entry_issue_... and
_cod_changelog_entry_... loops.
As an example, I can show two entries from your list which I have fixed
recently [3,4]. I have inspected the supplementary file provided with
the original publication and found a comment that said: "Starred atoms
were refined isotropically" next to the atom coordinate loops. As there
was indeed no U_ij data items for the "starred" atoms, I concluded that
we can simply remove the '*' characters from the values to make them
correct CIFs. The information about isotropic refinement is not lost
since it can be inferred from the missing U_ij values, and is also
documented in the COD Subversion revision history and in the file itself.
The downside of this policy is that it takes time and we can not promise
fast fixes (it took me about an hour to investigate and fix these two
entries [3,4]. The good thing is that the COD will never mislead the
users and distort the original authors' claims if we strictly follow the
above rules. Since COD is intended for scientific research we think that
correct representation of the scientific data is the ultimate priority.
So essentially there are the following two ways to deal with the issues
you have found:
a) you let us fix the issues in our (slow) pace; eventually, I hope,
they will end up in the main COD repository;
b) you fix the COD records using the above principles and send them to
us; we check them and commit them to the COD repository (under our name,
if you wish to get a credit ;). Please make sure that you have
corroborating evidence for the changes in the comments section; you can
either use the cod_entry_issue_... and _cod_changelog_entry_... loops as
in [3,4] (which is preferable) or, if that seems too complicated, you
can use the older _cod_depositor_comments tag to describe your changes
as in the older 7026981 example [5].
If there is no evidence in the original publication as to what correct
values should be inserted into the text, we /should not/ fix them
(unless we contact the author of the structure and the he or she
confirms the change). In that case, we can only mark the COD entry with
the _cod_entry_issue_... data elements and set severity to 'warning' or
'error'. Your software should then skip such entries if it can not
process them.
For your computations, you can apply local fixes (e.g. in form of
patches, as used by the Unix 'patch' command) immediately before
processing; or you can ignore COD records with errors (which you
probably do ;). In both cases your procedures and workflow will be
properly documented, so I do not see any problems with scientific
reproducibility if you use patched COD records on your side, provided
the patch files are recorded in your workflow.
I hope this clarifies the situation with buggy COD records.
Unfortunately we still do have them – such was the state of the art in
scientific data publishing when these data files were produced...
Sincerely yours,
Sauliu
> Regards
>
> Matthew
>
> file msg
> cif\7026981.cif could not convert string to float: '6.5*'
> cif\2003944.cif could not convert string to float: 'O'
> cif\2002897.cif could not convert string to float: '-'
> cif\7027704.cif could not convert string to float: '0..049'
> cif\7217107.cif could not convert string to float: '-'
> cif\1516392.cif could not convert string to float: '0..0498'
> cif\2001164.cif could not convert string to float: '.093.'
> cif\2010104.cif could not convert string to float: '.07.9'
> cif\4115528.cif could not convert string to float: '0..076'
> cif\2001895.cif could not convert string to float: '.0.828'
> cif\2003945.cif could not convert string to float: 'O'
> cif\2007662.cif could not convert string to float: '0.0634*'
> cif\2000499.cif could not convert string to float: '.0479{6)'
> cif\4065647.cif could not convert string to float: '0.11152)'
> cif\4321917.cif could not convert string to float: '0.0)'
> cif\4112884.cif could not convert string to float: '0..0460'
> cif\2101447.cif could not convert string to float: '..114'
> cif\7012794.cif could not convert string to float: '.0.053'
> cif\4321125.cif could not convert string to float: ':0.0321'
> cif\7026982.cif could not convert string to float: '6*'
> cif\2006442.cif could not convert string to float: '0..0770'
> cif\2010442.cif could not convert string to float: '0.08*'
> cif\7009740.cif could not convert string to float: '0..0307'
> cif\2003869.cif could not convert string to float: '.0458^a^'
> cif\2003381.cif could not convert string to float: '6*'
> cif\7051472.cif could not convert string to float: '0..066'
> cif\2101014.cif could not convert string to float: '.o39'
> cif\2003699.cif could not convert string to float: '4.0*'
> cif\2206732.cif could not convert string to float: '0.029m6'
> cif\2001302.cif could not convert string to float: '0.047*'
> cif\2006197.cif could not convert string to float: '0..076'
> cif\4085045.cif could not convert string to float: '0..059'
> cif\2007409.cif could not convert string to float: '0..0535'
> cif\2009533.cif could not convert string to float: '/'
> cif\2007301.cif could not convert string to float: ''
> cif\4320747.cif could not convert string to float: 'H'
> cif\4320905.cif could not convert string to float: '0..0629'
> cif\2007048.cif could not convert string to float: '0.0393*'
> cif\4322860.cif could not convert string to float: '0..055'
> cif\2003192.cif could not convert string to float: '0.08*'
> cif\4322021.cif could not convert string to float: '0..073'
> cif\2006384.cif could not convert string to float: '0..131'
> cif\4315383.cif could not convert string to float: '0.0.99'
> cif\4300081.cif could not convert string to float: '0.l09'
> cif\7027965.cif could not convert string to float: '0..037'
> cif\4061854.cif could not convert string to float: '0._61'
> cif\4300051.cif could not convert string to float: '0..0565'
> cif\2005964.cif could not convert string to float: '4.4*'
> cif\2009883.cif could not convert string to float: '.046.5'
> cif\2007654.cif could not convert string to float: '0.0494*'
> cif\4323741.cif could not convert string to float: '0..0283'
> cif\4322390.cif could not convert string to float: '0..115'
> cif\7111128.cif could not convert string to float: '0..091'
> cif\2005961.cif could not convert string to float: "C1'"
> cif\4333084.cif could not convert string to float: '0.09817)'
> cif\2009384.cif could not convert string to float: 'H91'
> cif\4320033.cif could not convert string to float: '..0794'
> cif\7027705.cif could not convert string to float: '0..056'
> cif\2002898.cif could not convert string to float: '-'
> cif\4322414.cif could not convert string to float: '0..0266'
> cif\3500007.cif could not convert string to float: '0.O1D'
> cif\2007253.cif could not convert string to float: '0..0684'
> cif\2005907.cif could not convert string to float: '0.0962*'
> cif\4115907.cif could not convert string to float: '0..0486'
> cif\4322396.cif could not convert string to float: '0..042'
> cif\7009739.cif could not convert string to float: '0..0477'
> cif\2101312.cif could not convert string to float: '6.0*'
> cif\7702612.cif could not convert string to float: '16*'
>
>
> --
> This message has been scanned for viruses and
> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
> believed to be clean.
>
> _______________________________________________
> Cod-bugs mailing list
> Cod-bugs at lists.crystallography.net
> http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs
References:
[1] Vaitkus, A.; Merkys, A. & Gražulis, S. Validation of the
Crystallography Open Database using the Crystallographic Information
Framework. Journal of Applied Crystallography, International Union of
Crystallography (IUCr), 2021, 54, 1-12 DOI:
https://doi.org/10.1107/s1600576720016532
[2] Vaitkus, A. COD validation issue database. 2021, URL:
http://sql.crystallography.net/db/cod_validation [accessed
2022-09-14T10:07+03:00]. NOTE: the page is slow to load, please be patient!
[3] http://www.crystallography.net/cod/7026981.html [accessed
2022-09-14T10:18+03:00]
[4] http://www.crystallography.net/cod/7026982.html [accessed
2022-09-14T10:18+03:00]
[5] Older revision of the COD 7026981 entry, showing the use of the
_cod_depositor_comments data item. URL:
http://www.crystallography.net/cod/7026981.cif@277790 [accessed
2022-09-14T10:35+03:00]
--
Dr. Saulius Gražulis
Vilnius University Institute of Biotechnology, Saulėtekio al. 7
LT-10257 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353
mobile: (+370-684)-49802, (+370-614)-36366
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20220914/e2cac826/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: grazulis.vcf
Type: text/vcard
Size: 4 bytes
Desc: not available
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20220914/e2cac826/attachment-0001.vcf>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20220914/e2cac826/attachment-0001.sig>
More information about the Cod-bugs
mailing list