[Cod-bugs] Bugs found in COD: would you like me to send you id's?

Saulius Gražulis grazulis at ibt.lt
Mon Jul 24 14:15:06 EEST 2023


Dear Harry,

thank you for your e-mail!

On 2023-07-22 20:07, Harry Dudley-Bestow wrote:
> I have been dabbling about in the COD database doing various 
> explorations for some machine learning stuff I have been doing as a 
> hobby project. In the process I have come across quite a few examples 
> of what (to me) look like obviously incorrect crystal structures. Here 
> is one example:
> https://www.crystallography.net/cod/7237139.html
> Where the carbon-carbon distance is wayyy smaller than you would 
> expect for a regular compound.

In this case, the carbon atom is disordered, and is modelled as two 
sites belonging to two different disorder groups [1]. The groups are 
instances of the same disorder assembly [2]. Thus the structure is 
perfectly correct from this point of view, and the disorder is marked up 
correctly. You software should be prepared to deal with such situations 
in one way or another.

If you would like to have files where disorder is resolved (so that they 
are closer to chemist's perception of the molecule), you may want to 
look into our generated molecular files [3], or the same files uploaded 
to PubChem [4].

> Would you like me to send tables of the offending compounds so...

The structures with the marked up disorder are known and catalogued, you 
can quickly get them using an SQL query:

mysql -u cod_reader -h sql.crystallography.net cod -e 'select file from 
data where flags like "%has disorder%"'

There are of course entries were disorder is not marked up correctly, or 
where other problems exist. We have run different CIF validation 
software on the COD collection, and there are about 11 mln. validation 
messages that we get from published structures (!). This is a daunting 
number, and clearly needs a special way to deal with problems, with 
great deal of ad-hoc automation. We are slowly grinding though this 
issue list :)

The outputs (logs) of you processing software would be, no doubt, very 
useful to look into additional validation issues, and to see what the 
COD users expect from he data files we present. We have such 
contributions from multiple users by now, and some lead to correction of 
COD entries. We can not, unfortunately, promise that we act on them 
immediately, due to a limited manpower that we poses. The most valuable 
logs are that ones that:

- contain error messages compatible with the IUCr standards (the example 
above, as mentioned, is not an error in the COD);

- allow to identify the COD record and the nature of the problem 
immediately, without additional computations;

- contain suggestions how to rectify the situation, preferably 
automatically; of course we can only apply corrections that are 
compatible with the IUCr standards and COD data policies.

> ... that they can be removed? I don't know what COD's policy is with 
> regards to removing suspect data.

The COD never removes any record and its ID, even if a structure is 
retracted (e.g. when scientific fraud is detected). Instead, the 
structure is marked using data items in the '_cod_entry_issue_[]' 
category (please have a look at the COD CIF dictionary [5]). For 
retracted records, we replace coordinates by dots ('.' values). In less 
severe cases, we try to rectify the entry of possible (without changing 
the original authors' interpretation), and the record gets a new 
revision. The changes are marked with the '_cod_entry_issue_[]' category 
items or (in the older entries) using _cod_depositor_comments text field.

Regards,
Saulius

'Refs.:

[1] 
https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_group.html

[2] 
https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_assembly.html

[3] 
http://molecules.crystallography.net/~saulius/cod-molecules/cgi-bin/run.pl?command=cod-molecule-display&codid=7237139

[4] https://pubchem.ncbi.nlm.nih.gov/substance/482279549

[5] http://www.crystallography.net/cod/cif/dictionaries/cif_cod.dic

-- 
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Cod-bugs mailing list