From grazulis at ibt.lt Mon Jul 24 14:15:06 2023 From: grazulis at ibt.lt (=?UTF-8?Q?Saulius_Gra=c5=beulis?=) Date: Mon, 24 Jul 2023 14:15:06 +0300 Subject: [Cod-bugs] Bugs found in COD: would you like me to send you id's? In-Reply-To: References: Message-ID: <6fd099de-ba70-af9d-f546-32f87d79cda2@ibt.lt> Dear Harry, thank you for your e-mail! On 2023-07-22 20:07, Harry Dudley-Bestow wrote: > I have been dabbling about in the COD database doing various > explorations for some machine learning stuff I have?been doing as a > hobby project. In the process I have come across quite a few examples > of what (to me) look like obviously incorrect crystal structures. Here > is one example: > https://www.crystallography.net/cod/7237139.html > Where the carbon-carbon distance is wayyy smaller than you would > expect for a regular compound. In this case, the carbon atom is disordered, and is modelled as two sites belonging to two different disorder groups [1]. The groups are instances of the same disorder assembly [2]. Thus the structure is perfectly correct from this point of view, and the disorder is marked up correctly. You software should be prepared to deal with such situations in one way or another. If you would like to have files where disorder is resolved (so that they are closer to chemist's perception of the molecule), you may want to look into our generated molecular files [3], or the same files uploaded to PubChem [4]. > Would you like me to send tables of the offending compounds so... The structures with the marked up disorder are known and catalogued, you can quickly get them using an SQL query: mysql -u cod_reader -h sql.crystallography.net cod -e 'select file from data where flags like "%has disorder%"' There are of course entries were disorder is not marked up correctly, or where other problems exist. We have run different CIF validation software on the COD collection, and there are about 11 mln. validation messages that we get from published structures (!). This is a daunting number, and clearly needs a special way to deal with problems, with great deal of ad-hoc automation. We are slowly grinding though this issue list :) The outputs (logs) of you processing software would be, no doubt, very useful to look into additional validation issues, and to see what the COD users expect from he data files we present. We have such contributions from multiple users by now, and some lead to correction of COD entries. We can not, unfortunately, promise that we act on them immediately, due to a limited manpower that we poses. The most valuable logs are that ones that: - contain error messages compatible with the IUCr standards (the example above, as mentioned, is not an error in the COD); - allow to identify the COD record and the nature of the problem immediately, without additional computations; - contain suggestions how to rectify the situation, preferably automatically; of course we can only apply corrections that are compatible with the IUCr standards and COD data policies. > ... that they can be removed? I don't know what COD's policy is with > regards to removing suspect data. The COD never removes any record and its ID, even if a structure is retracted (e.g. when scientific fraud is detected). Instead, the structure is marked using data items in the '_cod_entry_issue_[]' category (please have a look at the COD CIF dictionary [5]). For retracted records, we replace coordinates by dots ('.' values). In less severe cases, we try to rectify the entry of possible (without changing the original authors' interpretation), and the record gets a new revision. The changes are marked with the '_cod_entry_issue_[]' category items or (in the older entries) using _cod_depositor_comments text field. Regards, Saulius 'Refs.: [1] https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_group.html [2] https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_assembly.html [3] http://molecules.crystallography.net/~saulius/cod-molecules/cgi-bin/run.pl?command=cod-molecule-display&codid=7237139 [4] https://pubchem.ncbi.nlm.nih.gov/substance/482279549 [5] http://www.crystallography.net/cod/cif/dictionaries/cif_cod.dic -- Dr. Saulius Gra?ulis Vilnius University, Life Science Center, Institute of Biotechnology Saul?tekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania) phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366 -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From harry.dudleybestow at gmail.com Mon Jul 24 18:02:01 2023 From: harry.dudleybestow at gmail.com (Harry Dudley-Bestow) Date: Mon, 24 Jul 2023 08:02:01 -0700 Subject: [Cod-bugs] Bugs found in COD: would you like me to send you id's? In-Reply-To: <6fd099de-ba70-af9d-f546-32f87d79cda2@ibt.lt> References: <6fd099de-ba70-af9d-f546-32f87d79cda2@ibt.lt> Message-ID: Thank you for your extensive response Saulius! As you may have guessed I do not know very much about crystallography, and am just now learning what a "disorder group" is. Thank you for taking the time to explain to me what is going on, and also about how I can query for molecules with known irregularities. If in the future if I manage to spot some issues and am sufficiently certain about the nature of the issue that I can generate an official error code I will most certainly send them over. I've been reading a number of papers in the machine learning literature in which computer scientists attempt to predict various chemical properties using databases like the COD. They never seem to mention the presence of disorder or symmetry groups and reading through the code I am getting a sneaking suspicion that many of them are not even aware that they exist! The policy regarding the removal of COD numbers entries makes sense, one wouldn't want to remove an entry entirely and open the possibility of re-assigning a different molecule again to the same id. Regards, Harry On Mon, Jul 24, 2023 at 4:20?AM Saulius Gra?ulis wrote: > Dear Harry, > > thank you for your e-mail! > > On 2023-07-22 20:07, Harry Dudley-Bestow wrote: > > I have been dabbling about in the COD database doing various > > explorations for some machine learning stuff I have been doing as a > > hobby project. In the process I have come across quite a few examples > > of what (to me) look like obviously incorrect crystal structures. Here > > is one example: > > https://www.crystallography.net/cod/7237139.html > > Where the carbon-carbon distance is wayyy smaller than you would > > expect for a regular compound. > > In this case, the carbon atom is disordered, and is modelled as two > sites belonging to two different disorder groups [1]. The groups are > instances of the same disorder assembly [2]. Thus the structure is > perfectly correct from this point of view, and the disorder is marked up > correctly. You software should be prepared to deal with such situations > in one way or another. > > If you would like to have files where disorder is resolved (so that they > are closer to chemist's perception of the molecule), you may want to > look into our generated molecular files [3], or the same files uploaded > to PubChem [4]. > > > Would you like me to send tables of the offending compounds so... > > The structures with the marked up disorder are known and catalogued, you > can quickly get them using an SQL query: > > mysql -u cod_reader -h sql.crystallography.net cod -e 'select file from > data where flags like "%has disorder%"' > > There are of course entries were disorder is not marked up correctly, or > where other problems exist. We have run different CIF validation > software on the COD collection, and there are about 11 mln. validation > messages that we get from published structures (!). This is a daunting > number, and clearly needs a special way to deal with problems, with > great deal of ad-hoc automation. We are slowly grinding though this > issue list :) > > The outputs (logs) of you processing software would be, no doubt, very > useful to look into additional validation issues, and to see what the > COD users expect from he data files we present. We have such > contributions from multiple users by now, and some lead to correction of > COD entries. We can not, unfortunately, promise that we act on them > immediately, due to a limited manpower that we poses. The most valuable > logs are that ones that: > > - contain error messages compatible with the IUCr standards (the example > above, as mentioned, is not an error in the COD); > > - allow to identify the COD record and the nature of the problem > immediately, without additional computations; > > - contain suggestions how to rectify the situation, preferably > automatically; of course we can only apply corrections that are > compatible with the IUCr standards and COD data policies. > > > ... that they can be removed? I don't know what COD's policy is with > > regards to removing suspect data. > > The COD never removes any record and its ID, even if a structure is > retracted (e.g. when scientific fraud is detected). Instead, the > structure is marked using data items in the '_cod_entry_issue_[]' > category (please have a look at the COD CIF dictionary [5]). For > retracted records, we replace coordinates by dots ('.' values). In less > severe cases, we try to rectify the entry of possible (without changing > the original authors' interpretation), and the record gets a new > revision. The changes are marked with the '_cod_entry_issue_[]' category > items or (in the older entries) using _cod_depositor_comments text field. > > Regards, > Saulius > > 'Refs.: > > [1] > > https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_group.html > > [2] > > https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_assembly.html > > [3] > > http://molecules.crystallography.net/~saulius/cod-molecules/cgi-bin/run.pl?command=cod-molecule-display&codid=7237139 > > [4] https://pubchem.ncbi.nlm.nih.gov/substance/482279549 > > [5] http://www.crystallography.net/cod/cif/dictionaries/cif_cod.dic > > -- > Dr. Saulius Gra?ulis > Vilnius University, Life Science Center, Institute of Biotechnology > Saul?tekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania) > phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, > (+370-614)-36366 > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: