[Cod-bugs] I found some errors in CIFs
Matthew Rowles
rowlesmr at gmail.com
Wed Sep 14 12:08:30 EEST 2022
Thanks for the reply Saulius
Obviously, in my naivety, "obvious" isn't really the case. :)
In my use-case, if it didn't parse, I just ignored it. The number of
non-parsing files vs parsing was such that it is essentially insignificant,
so there's that vote in your favour!
I'll see if I can dig up original source material on (at least some of) the
files above and correct them. I'll use 3 and 4 as a guide.
THanks
Matthew
On Wed, 14 Sept 2022 at 15:51, Saulius Gražulis <grazulis at ibt.lt> wrote:
> Dear Matthew!
>
> Thank you for reporting errors in the COD database, and sorry for not
> responding immediately. It too us some time to look into the issues.
>
> The problems are indeed serious, and I'll address on how to deal with them
> in the letter below:
>
> On 2022-08-13 12:55, Matthew Rowles wrote:
>
> Hi COD people
>
> I was playing around with some ideas this week, and when parsing all the
> CIFs, I got the following errors from my program. They will all relate to
> errors in how the values of atom coordinates, atomic displacement
> parameters, or cell parameters are given.
> I've looked at most of them, and they're fairly obvious in their error. In
> all cases, this is just the first parsing error encountered; in most files
> there are more than one.
>
> What is the best way to go about fixing them?
>
> The problems that you list should be fixed whenever possible. Actually,
> there is an ongoing effort to use CIF validator to find problems of the
> similar kind and many others; CIF validator [1] is run on the whole COD
> database, and issues are recorded in separate DB tables [2]. The problem is
> that there is over 11 mln. (!) validation messages; we are still in the
> process of triaging them and setting up the data curation queue.
>
> In this context your feedback is extremely helpful, since you tell us what
> is important for you as the COD user, and we can now set these issues to
> the high priority!
>
> Fixing COD records.
> ----------------------------
>
> When fixing COD records in the process of data curation, we stick to one
> main principle: "do not invent data". This basically means that we (as a
> community ;) should only change the COD data record if there is positive
> evidence that the author(s) of the structure indeed meant what we are
> inserting into the COD CIF. So, for instance, we would only change value
> '.07.9' in the COD 2010104 entry to '0.079' if the original paper text
> mentions somewhere in the text, of there are derived data (say bond
> lengths) that can only be correct if the number in question is '0.079'.
> Otherwise we do not know if the number was '7.9' or '0.07(9)' or something
> like that. Likewise, in the entry 4065647, the number '0.11152)' could be
> actually '0.1115(2)' or '0.111(52)', and we do not know which without
> further investigation.
>
> All changes are documented within the corresponding COD CIF. The older
> data item was _cod_depositor_comments with a free-format text as its value
> with the description of the changes and issues; more recently we are
> switching to more structured comments in the cod_entry_issue_... and
> _cod_changelog_entry_... loops.
>
> As an example, I can show two entries from your list which I have fixed
> recently [3,4]. I have inspected the supplementary file provided with the
> original publication and found a comment that said: "Starred atoms were
> refined isotropically" next to the atom coordinate loops. As there was
> indeed no U_ij data items for the "starred" atoms, I concluded that we can
> simply remove the '*' characters from the values to make them correct CIFs.
> The information about isotropic refinement is not lost since it can be
> inferred from the missing U_ij values, and is also documented in the COD
> Subversion revision history and in the file itself.
>
> The downside of this policy is that it takes time and we can not promise
> fast fixes (it took me about an hour to investigate and fix these two
> entries [3,4]. The good thing is that the COD will never mislead the users
> and distort the original authors' claims if we strictly follow the above
> rules. Since COD is intended for scientific research we think that correct
> representation of the scientific data is the ultimate priority.
>
> So essentially there are the following two ways to deal with the issues
> you have found:
>
> a) you let us fix the issues in our (slow) pace; eventually, I hope, they
> will end up in the main COD repository;
>
> b) you fix the COD records using the above principles and send them to us;
> we check them and commit them to the COD repository (under our name, if you
> wish to get a credit ;). Please make sure that you have corroborating
> evidence for the changes in the comments section; you can either use the
> cod_entry_issue_... and _cod_changelog_entry_... loops as in [3,4] (which
> is preferable) or, if that seems too complicated, you can use the older
> _cod_depositor_comments tag to describe your changes as in the older
> 7026981 example [5].
>
> If there is no evidence in the original publication as to what correct
> values should be inserted into the text, we *should not* fix them (unless
> we contact the author of the structure and the he or she confirms the
> change). In that case, we can only mark the COD entry with the
> _cod_entry_issue_... data elements and set severity to 'warning' or
> 'error'. Your software should then skip such entries if it can not process
> them.
>
> For your computations, you can apply local fixes (e.g. in form of patches,
> as used by the Unix 'patch' command) immediately before processing; or you
> can ignore COD records with errors (which you probably do ;). In both cases
> your procedures and workflow will be properly documented, so I do not see
> any problems with scientific reproducibility if you use patched COD records
> on your side, provided the patch files are recorded in your workflow.
>
> I hope this clarifies the situation with buggy COD records. Unfortunately
> we still do have them – such was the state of the art in scientific data
> publishing when these data files were produced...
>
> Sincerely yours,
> Sauliu
>
> Regards
>
> Matthew
>
> file msg
> cif\7026981.cif could not convert string to float: '6.5*'
> cif\2003944.cif could not convert string to float: 'O'
> cif\2002897.cif could not convert string to float: '-'
> cif\7027704.cif could not convert string to float: '0..049'
> cif\7217107.cif could not convert string to float: '-'
> cif\1516392.cif could not convert string to float: '0..0498'
> cif\2001164.cif could not convert string to float: '.093.'
> cif\2010104.cif could not convert string to float: '.07.9'
> cif\4115528.cif could not convert string to float: '0..076'
> cif\2001895.cif could not convert string to float: '.0.828'
> cif\2003945.cif could not convert string to float: 'O'
> cif\2007662.cif could not convert string to float: '0.0634*'
> cif\2000499.cif could not convert string to float: '.0479{6)'
> cif\4065647.cif could not convert string to float: '0.11152)'
> cif\4321917.cif could not convert string to float: '0.0)'
> cif\4112884.cif could not convert string to float: '0..0460'
> cif\2101447.cif could not convert string to float: '..114'
> cif\7012794.cif could not convert string to float: '.0.053'
> cif\4321125.cif could not convert string to float: ':0.0321'
> cif\7026982.cif could not convert string to float: '6*'
> cif\2006442.cif could not convert string to float: '0..0770'
> cif\2010442.cif could not convert string to float: '0.08*'
> cif\7009740.cif could not convert string to float: '0..0307'
> cif\2003869.cif could not convert string to float: '.0458^a^'
> cif\2003381.cif could not convert string to float: '6*'
> cif\7051472.cif could not convert string to float: '0..066'
> cif\2101014.cif could not convert string to float: '.o39'
> cif\2003699.cif could not convert string to float: '4.0*'
> cif\2206732.cif could not convert string to float: '0.029m6'
> cif\2001302.cif could not convert string to float: '0.047*'
> cif\2006197.cif could not convert string to float: '0..076'
> cif\4085045.cif could not convert string to float: '0..059'
> cif\2007409.cif could not convert string to float: '0..0535'
> cif\2009533.cif could not convert string to float: '/'
> cif\2007301.cif could not convert string to float: ''
> cif\4320747.cif could not convert string to float: 'H'
> cif\4320905.cif could not convert string to float: '0..0629'
> cif\2007048.cif could not convert string to float: '0.0393*'
> cif\4322860.cif could not convert string to float: '0..055'
> cif\2003192.cif could not convert string to float: '0.08*'
> cif\4322021.cif could not convert string to float: '0..073'
> cif\2006384.cif could not convert string to float: '0..131'
> cif\4315383.cif could not convert string to float: '0.0.99'
> cif\4300081.cif could not convert string to float: '0.l09'
> cif\7027965.cif could not convert string to float: '0..037'
> cif\4061854.cif could not convert string to float: '0._61'
> cif\4300051.cif could not convert string to float: '0..0565'
> cif\2005964.cif could not convert string to float: '4.4*'
> cif\2009883.cif could not convert string to float: '.046.5'
> cif\2007654.cif could not convert string to float: '0.0494*'
> cif\4323741.cif could not convert string to float: '0..0283'
> cif\4322390.cif could not convert string to float: '0..115'
> cif\7111128.cif could not convert string to float: '0..091'
> cif\2005961.cif could not convert string to float: "C1'"
> cif\4333084.cif could not convert string to float: '0.09817)'
> cif\2009384.cif could not convert string to float: 'H91'
> cif\4320033.cif could not convert string to float: '..0794'
> cif\7027705.cif could not convert string to float: '0..056'
> cif\2002898.cif could not convert string to float: '-'
> cif\4322414.cif could not convert string to float: '0..0266'
> cif\3500007.cif could not convert string to float: '0.O1D'
> cif\2007253.cif could not convert string to float: '0..0684'
> cif\2005907.cif could not convert string to float: '0.0962*'
> cif\4115907.cif could not convert string to float: '0..0486'
> cif\4322396.cif could not convert string to float: '0..042'
> cif\7009739.cif could not convert string to float: '0..0477'
> cif\2101312.cif could not convert string to float: '6.0*'
> cif\7702612.cif could not convert string to float: '16*'
>
> --
> This message has been scanned for viruses and
> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
> believed to be clean.
>
> _______________________________________________
> Cod-bugs mailing listCod-bugs at lists.crystallography.nethttp://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs
>
> References:
>
> [1] Vaitkus, A.; Merkys, A. & Gražulis, S. Validation of the
> Crystallography Open Database using the Crystallographic Information
> Framework. Journal of Applied Crystallography, International Union of
> Crystallography (IUCr), 2021, 54, 1-12 DOI:
> https://doi.org/10.1107/s1600576720016532
>
> [2] Vaitkus, A. COD validation issue database. 2021, URL:
> http://sql.crystallography.net/db/cod_validation [accessed
> 2022-09-14T10:07+03:00]. NOTE: the page is slow to load, please be patient!
>
> [3] http://www.crystallography.net/cod/7026981.html [accessed
> 2022-09-14T10:18+03:00]
>
> [4] http://www.crystallography.net/cod/7026982.html [accessed
> 2022-09-14T10:18+03:00]
>
> [5] Older revision of the COD 7026981 entry, showing the use of the
> _cod_depositor_comments data item. URL:
> http://www.crystallography.net/cod/7026981.cif@277790 [accessed
> 2022-09-14T10:35+03:00]
>
> --
> Dr. Saulius Gražulis
> Vilnius University Institute of Biotechnology, Saulėtekio al. 7
> LT-10257 Vilnius, Lietuva (Lithuania)
> fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353
> mobile: (+370-684)-49802, (+370-614)-36366
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20220914/a02f7710/attachment-0001.htm>
More information about the Cod-bugs
mailing list