[Cod-bugs] Bugs found in COD: would you like me to send you id's?

Harry Dudley-Bestow harry.dudleybestow at gmail.com
Mon Jul 24 18:02:01 EEST 2023


Thank you for your extensive response Saulius! As you may have guessed I do
not know very much about crystallography, and am just now learning what a
"disorder group" is. Thank you for taking the time to explain to me what is
going on, and also about how I can query for molecules with known
irregularities.

If in the future if I manage to spot some issues and am sufficiently
certain about the nature of the issue that I can generate an official error
code I will most certainly send them over. I've been reading a number of
papers in the machine learning literature in which computer scientists
attempt to predict various chemical properties using databases like the
COD. They never seem to mention the presence of disorder or symmetry groups
and reading through the code I am getting a sneaking suspicion that many of
them are not even aware that they exist!
 The policy regarding the removal of COD numbers entries makes sense, one
wouldn't want to remove an entry entirely and open the possibility of
re-assigning a different molecule again to the same id.

Regards,
Harry


On Mon, Jul 24, 2023 at 4:20 AM Saulius Gražulis <grazulis at ibt.lt> wrote:

> Dear Harry,
>
> thank you for your e-mail!
>
> On 2023-07-22 20:07, Harry Dudley-Bestow wrote:
> > I have been dabbling about in the COD database doing various
> > explorations for some machine learning stuff I have been doing as a
> > hobby project. In the process I have come across quite a few examples
> > of what (to me) look like obviously incorrect crystal structures. Here
> > is one example:
> > https://www.crystallography.net/cod/7237139.html
> > Where the carbon-carbon distance is wayyy smaller than you would
> > expect for a regular compound.
>
> In this case, the carbon atom is disordered, and is modelled as two
> sites belonging to two different disorder groups [1]. The groups are
> instances of the same disorder assembly [2]. Thus the structure is
> perfectly correct from this point of view, and the disorder is marked up
> correctly. You software should be prepared to deal with such situations
> in one way or another.
>
> If you would like to have files where disorder is resolved (so that they
> are closer to chemist's perception of the molecule), you may want to
> look into our generated molecular files [3], or the same files uploaded
> to PubChem [4].
>
> > Would you like me to send tables of the offending compounds so...
>
> The structures with the marked up disorder are known and catalogued, you
> can quickly get them using an SQL query:
>
> mysql -u cod_reader -h sql.crystallography.net cod -e 'select file from
> data where flags like "%has disorder%"'
>
> There are of course entries were disorder is not marked up correctly, or
> where other problems exist. We have run different CIF validation
> software on the COD collection, and there are about 11 mln. validation
> messages that we get from published structures (!). This is a daunting
> number, and clearly needs a special way to deal with problems, with
> great deal of ad-hoc automation. We are slowly grinding though this
> issue list :)
>
> The outputs (logs) of you processing software would be, no doubt, very
> useful to look into additional validation issues, and to see what the
> COD users expect from he data files we present. We have such
> contributions from multiple users by now, and some lead to correction of
> COD entries. We can not, unfortunately, promise that we act on them
> immediately, due to a limited manpower that we poses. The most valuable
> logs are that ones that:
>
> - contain error messages compatible with the IUCr standards (the example
> above, as mentioned, is not an error in the COD);
>
> - allow to identify the COD record and the nature of the problem
> immediately, without additional computations;
>
> - contain suggestions how to rectify the situation, preferably
> automatically; of course we can only apply corrections that are
> compatible with the IUCr standards and COD data policies.
>
> > ... that they can be removed? I don't know what COD's policy is with
> > regards to removing suspect data.
>
> The COD never removes any record and its ID, even if a structure is
> retracted (e.g. when scientific fraud is detected). Instead, the
> structure is marked using data items in the '_cod_entry_issue_[]'
> category (please have a look at the COD CIF dictionary [5]). For
> retracted records, we replace coordinates by dots ('.' values). In less
> severe cases, we try to rectify the entry of possible (without changing
> the original authors' interpretation), and the record gets a new
> revision. The changes are marked with the '_cod_entry_issue_[]' category
> items or (in the older entries) using _cod_depositor_comments text field.
>
> Regards,
> Saulius
>
> 'Refs.:
>
> [1]
>
> https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_group.html
>
> [2]
>
> https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_assembly.html
>
> [3]
>
> http://molecules.crystallography.net/~saulius/cod-molecules/cgi-bin/run.pl?command=cod-molecule-display&codid=7237139
>
> [4] https://pubchem.ncbi.nlm.nih.gov/substance/482279549
>
> [5] http://www.crystallography.net/cod/cif/dictionaries/cif_cod.dic
>
> --
> Dr. Saulius Gražulis
> Vilnius University, Life Science Center, Institute of Biotechnology
> Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
> phone (office): (+370-5)-2234353, mobile: (+370-684)-49802,
> (+370-614)-36366
>
>
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230724/d616c9d7/attachment.htm>


More information about the Cod-bugs mailing list