From grazulis at ibt.lt Tue Nov 28 09:57:54 2023 From: grazulis at ibt.lt (=?UTF-8?Q?Saulius_Gra=C5=BEulis?=) Date: Tue, 28 Nov 2023 09:57:54 +0200 Subject: [Cod-bugs] Some structures published in Acta Cryst. E are missing in COD In-Reply-To: References: Message-ID: <9e8df34d-d276-4fdb-a572-646f7d1e9758@ibt.lt> Dear Koichi, thank you very much for your e-mail and for the valuable bug report! On 2023-11-25 03:28, Koichi Kitahara wrote: > > I found that some structures published in Acta Cryst. E are missing in COD. > Are there any criteria for structures to be available in COD, The only exclusion criteria for structures published in reputable peer-reviewed publications is that we do not want, as a rule, theoretical structures into the COD ? such structures should go into the TCOD [2]. Even then sometimes theoretical structures end up in the COD if they are not recognised as theoretical, and are only sorted later. So any missing structure can be considered a bug in the COD. > ... or are > there some bugs in automatic extraction routines of COD? You are right, there was a transient problem in the deposition pipeline (connection to port 80 timed out). Such timeouts happen when the system is overloaded with external connections; we are working to find a solution for this problem. Meanwhile, I have re-run the deposition scripts manually and the structures are in the COD :) [3] > In particular, I'm wondering if our structures [1] will be available in > COD in near future or not. We are trying to get structures into the COD as soon as possible. I'am sorry for this delay in the publication of the structures, this was a glitch which we overlooked. Meanwhile, we would be grateful if you report any other such issues of the missing published structures. While we do have the deposition logs for these failures, we lack manpower to address these issues quickly; thus the reports from the COD users are very helpful :) Sincerely yours, Saulius Refs.: > [1] https://doi.org/10.1107/S2056989023008393 [2] The Theoretical Crystallography Open Database (TCOD). https://www.crystallography.net/tcod/ [3] Kitahara, Koichi; Takakura, Hiroyuki; Iwasaki, Yutaka; Kimura, Kaoru. Crystal structures of five compounds in the aluminium-ruthenium-silicon system. Acta crystallographica Section E, Crystallographic communications (2023) 79(10), 946--951. http://www.crystallography.net/cod/2312358.html http://www.crystallography.net/cod/2312359.html http://www.crystallography.net/cod/2312360.html http://www.crystallography.net/cod/2312361.html http://www.crystallography.net/cod/2312362.html [accessed 2023-11-28T09:50+02:00]. -- Dr. Saulius Gra?ulis Vilnius University, Life Science Center, Institute of Biotechnology Saul?tekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania) phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366 -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From grazulis at ibt.lt Tue Nov 28 11:51:48 2023 From: grazulis at ibt.lt (=?UTF-8?Q?Saulius_Gra=C5=BEulis?=) Date: Tue, 28 Nov 2023 11:51:48 +0200 Subject: [Cod-bugs] incomplete entries In-Reply-To: References: Message-ID: Dear Prof. Schmedt, thank you very much for your e-mail! Below, I will answer your questions in some detail: On 2023-11-25 14:33, Schmedt auf der G?nne, J?rn, Prof. Dr. wrote: > > The reason why am writing is I have several questions. > > 1.) I am/was writing some tcl-tk-scripts to search for compounds and a > useful strategy which I used in other databases was to suppress all > unwanted elements to produce all possible structures in a composition > space which could be compared with experimental data. For the COD I > stumbled over a problem. Such a search produces a lot of structures > where the *composition formula **is missing.* Please let us know more precisely how you define "composition formula". The CIF framework [1] in the DDL1 [2] and DDLm [3] dictionaries specify at least 5 data names for specifying chemical formulae (and 3 names for specifying molecular weight) [4]. Of these, the COD actively manages only the '_chemical_formula_sum' data item. Formulae in this data item, if present, are parsed and standardised, if possible, for all incoming COD CIFs. In the COD database table `data` [5], it is reported in the `formula` column; additional column `calcformula` is computed by the COD software from atomic coordinates. Both formulae are recorded in Hill notation [6]. The '_chemical_formula_sum' data item are present in 100% of the COD records; also, `formula` and `calcformula` values are present in all rows in the COD `data` SQL table (just checked ;). Maybe you can use these data items when processing the COD in you software? Other formula field of the CIF are not processed and not enforced; they are only present if the authors of the CIF provided them, and given in their original form (i.e. not standardised). Also, if the _chemical_formula_sum can not be parsed by our software, it is left as is; thus some `formula` columns might be not conformant with the Hill syntax definition. The `calcformula` fields are always in Hill notation, but they can differ from the formula reported by the authors of hydrogens are not marked up, disorder is not recorded or Z value is missing in the original file and can not be estimated reliably. > I also fond a number of entries *with missing fractional coordinates *or > where only a unit cell was specified but no structure was provided > (elements markes as "?"). > Is it intended to have such structures in the database or are such > structures "unwanted"? Indeed, for some of the COD entries miss coordinate data. The number of such structures is not large (1527 out of 508592, or below 0.3%). We definitely keep these structures for several reasons: - sometimes, we do not have the coordinates just because the structure is behind a paywall; in such cases we mark that the structure was published with as much information as we can get, and the coordinates can be supplemented later (e.g. if the original author provides them to the COD); - some very old structures do not have the coordinates refined or published, but the unit cell and space group information might be valuable on its own; - for retracted structures (unfortunately, there were some fraudulent structures published in scientific press), we delete the coordinates but keep the cell constants, summary formula and the symmetry, with the corresponding notes, to prevent redeposition of this structure by someone else in the future. You can filter out the structures that do not have coordinates either by parsing the corresponding CIF file or by querying the COD SQL database: > saulius at tasmanijos-velnias collection/ $ mysql -u cod_reader -h sql.crystallography.net cod -e 'select count(*) from data where flags like "%has coordinates%" and (status is null or status not like "%retracted%") and duplicateof is null and optimal is null' > +----------+ > | count(*) | > +----------+ > | 501668 | > +----------+ > I saw that upon submission there is a quality check which I think should > stop such structures from deposition in the database. Indeed, the submission scripts check the consistency of the provided data; in particular, they check that if one of the x,y,z coordinates is provided, then the remaining two are present as well. But the old publications might give crystal descriptions without coordinates... > 2.) How can I mark a structure as disordered upon submission? Disordered parts of the structure should be specified using _atom_site_disorder_group and _atom_site_disorder_assembly data items [7]. > 3.) How can I help to revise errors of structures which I find in the > database which have not been submitted by me? Please report them to the cod-bugs at ibt.lt, with as much information as possible. Sincerely yours, Saulius Refs.: [1] IUCr COMCIFS. CIF Core dictionaries. URL: https://www.iucr.org/resources/cif/dictionaries/cif_core [accessed 2023-11-28T11:12+02:00]. [2] IUCr COMCIFS. CIF Core DDL1 dictionary, v2.4.5 (2014). URL: https://www.iucr.org/__data/iucr/cif/dictionaries/cif_core_2.4.5.dic [accessed 2023-11-28T11:06+02:00]. [3] IUCr COMCIFS. CIF Core DDLm dictionary, v3.2.0 (2023). URL: https://www.iucr.org/__data/iucr/cif/dictionaries/cif_core_3.2.0.dic [accessed 2023-11-28T11:08+02:00]. [4] CIF Data names in the 'CHEMICAL_FORMULA' category: > saulius at tasmanijos-velnias ~/ $ grep _definition.id cif_core_3.2.0.dic | awk -F"'" '{print $2}' | grep . | grep _chemical_formula'\.' > _chemical_formula.analytical > _chemical_formula.IUPAC > _chemical_formula.moiety > _chemical_formula.structural > _chemical_formula.sum > _chemical_formula.weight > _chemical_formula.weight_meas > _chemical_formula.weight_meas_su URL: https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Cchemical_formula.html [accessed 2023-11-28T11:14+02:00]. [5] The COD Team. How to query the COD database. The COD Wiki (2023). URL: https://wiki.crystallography.net/howtoquerycod/ [accessed 2023-11-28T11:17+02:00]. [6] Hill, E. A. On a system of indexing chemical literature; adopted by the classification division of the U. S. Patent Office. Journal of the American Chemical Society, American Chemical Society (ACS), 1900, 22, 478-494, DOI: https://doi.org/10.1021/ja02046a005 [7] IUCr COMCIFS. Core dictionary (coreCIF) version 2.4.5. Atom site disorder group and assembly data items. URLs: https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_group.html ; https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_assembly.html [accessed 2023-11-28T11:44+02:00]. -- Dr. Saulius Gra?ulis Vilnius University, Life Science Center, Institute of Biotechnology Saul?tekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania) phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366 -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From kkitahara5101 at gmail.com Tue Nov 28 11:51:32 2023 From: kkitahara5101 at gmail.com (Koichi Kitahara) Date: Tue, 28 Nov 2023 18:51:32 +0900 Subject: [Cod-bugs] Some structures published in Acta Cryst. E are missing in COD In-Reply-To: <9e8df34d-d276-4fdb-a572-646f7d1e9758@ibt.lt> References: <9e8df34d-d276-4fdb-a572-646f7d1e9758@ibt.lt> Message-ID: Dear Saulius, Thank you for re-running the script to add our structures to COD. I would like to report If I find other missing structures. Best regards, Koichi Kitahara On Tue, Nov 28, 2023 at 4:58?PM Saulius Gra?ulis wrote: > Dear Koichi, > > thank you very much for your e-mail and for the valuable bug report! > > On 2023-11-25 03:28, Koichi Kitahara wrote: > > > > I found that some structures published in Acta Cryst. E are missing in > COD. > > Are there any criteria for structures to be available in COD, > > The only exclusion criteria for structures published in reputable > peer-reviewed publications is that we do not want, as a rule, > theoretical structures into the COD ? such structures should go into the > TCOD [2]. Even then sometimes theoretical structures end up in the COD > if they are not recognised as theoretical, and are only sorted later. > > So any missing structure can be considered a bug in the COD. > > > ... or are > > there some bugs in automatic extraction routines of COD? > > You are right, there was a transient problem in the deposition pipeline > (connection to port 80 timed out). Such timeouts happen when the system > is overloaded with external connections; we are working to find a > solution for this problem. > > Meanwhile, I have re-run the deposition scripts manually and the > structures are in the COD :) [3] > > > In particular, I'm wondering if our structures [1] will be available in > > COD in near future or not. > > We are trying to get structures into the COD as soon as possible. I'am > sorry for this delay in the publication of the structures, this was a > glitch which we overlooked. > > Meanwhile, we would be grateful if you report any other such issues of > the missing published structures. While we do have the deposition logs > for these failures, we lack manpower to address these issues quickly; > thus the reports from the COD users are very helpful :) > > Sincerely yours, > Saulius > > Refs.: > > > [1] https://doi.org/10.1107/S2056989023008393 > > > [2] The Theoretical Crystallography Open Database (TCOD). > https://www.crystallography.net/tcod/ > > [3] Kitahara, Koichi; Takakura, Hiroyuki; Iwasaki, Yutaka; Kimura, > Kaoru. Crystal structures of five compounds in the > aluminium-ruthenium-silicon system. Acta crystallographica Section E, > Crystallographic communications (2023) 79(10), 946--951. > > http://www.crystallography.net/cod/2312358.html > http://www.crystallography.net/cod/2312359.html > http://www.crystallography.net/cod/2312360.html > http://www.crystallography.net/cod/2312361.html > http://www.crystallography.net/cod/2312362.html > [accessed 2023-11-28T09:50+02:00]. > > -- > Dr. Saulius Gra?ulis > Vilnius University, Life Science Center, Institute of Biotechnology > Saul?tekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania) > phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, > (+370-614)-36366 > > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: