[Cod-bugs] Bugs found in COD: would you like me to send you id's?
Saulius Gražulis
grazulis at ibt.lt
Mon Jul 24 14:15:06 EEST 2023
Dear Harry,
thank you for your e-mail!
On 2023-07-22 20:07, Harry Dudley-Bestow wrote:
> I have been dabbling about in the COD database doing various
> explorations for some machine learning stuff I have been doing as a
> hobby project. In the process I have come across quite a few examples
> of what (to me) look like obviously incorrect crystal structures. Here
> is one example:
> https://www.crystallography.net/cod/7237139.html
> Where the carbon-carbon distance is wayyy smaller than you would
> expect for a regular compound.
In this case, the carbon atom is disordered, and is modelled as two
sites belonging to two different disorder groups [1]. The groups are
instances of the same disorder assembly [2]. Thus the structure is
perfectly correct from this point of view, and the disorder is marked up
correctly. You software should be prepared to deal with such situations
in one way or another.
If you would like to have files where disorder is resolved (so that they
are closer to chemist's perception of the molecule), you may want to
look into our generated molecular files [3], or the same files uploaded
to PubChem [4].
> Would you like me to send tables of the offending compounds so...
The structures with the marked up disorder are known and catalogued, you
can quickly get them using an SQL query:
mysql -u cod_reader -h sql.crystallography.net cod -e 'select file from
data where flags like "%has disorder%"'
There are of course entries were disorder is not marked up correctly, or
where other problems exist. We have run different CIF validation
software on the COD collection, and there are about 11 mln. validation
messages that we get from published structures (!). This is a daunting
number, and clearly needs a special way to deal with problems, with
great deal of ad-hoc automation. We are slowly grinding though this
issue list :)
The outputs (logs) of you processing software would be, no doubt, very
useful to look into additional validation issues, and to see what the
COD users expect from he data files we present. We have such
contributions from multiple users by now, and some lead to correction of
COD entries. We can not, unfortunately, promise that we act on them
immediately, due to a limited manpower that we poses. The most valuable
logs are that ones that:
- contain error messages compatible with the IUCr standards (the example
above, as mentioned, is not an error in the COD);
- allow to identify the COD record and the nature of the problem
immediately, without additional computations;
- contain suggestions how to rectify the situation, preferably
automatically; of course we can only apply corrections that are
compatible with the IUCr standards and COD data policies.
> ... that they can be removed? I don't know what COD's policy is with
> regards to removing suspect data.
The COD never removes any record and its ID, even if a structure is
retracted (e.g. when scientific fraud is detected). Instead, the
structure is marked using data items in the '_cod_entry_issue_[]'
category (please have a look at the COD CIF dictionary [5]). For
retracted records, we replace coordinates by dots ('.' values). In less
severe cases, we try to rectify the entry of possible (without changing
the original authors' interpretation), and the record gets a new
revision. The changes are marked with the '_cod_entry_issue_[]' category
items or (in the older entries) using _cod_depositor_comments text field.
Regards,
Saulius
'Refs.:
[1]
https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_group.html
[2]
https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_assembly.html
[3]
http://molecules.crystallography.net/~saulius/cod-molecules/cgi-bin/run.pl?command=cod-molecule-display&codid=7237139
[4] https://pubchem.ncbi.nlm.nih.gov/substance/482279549
[5] http://www.crystallography.net/cod/cif/dictionaries/cif_cod.dic
--
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Cod-bugs
mailing list