[Cod-bugs] incomplete entries

Saulius Gražulis grazulis at ibt.lt
Tue Nov 28 11:51:48 EET 2023


Dear Prof. Schmedt,

thank you very much for your e-mail! Below, I will answer your questions 
in some detail:

On 2023-11-25 14:33, Schmedt auf der Günne, Jörn, Prof. Dr. wrote:
> 
> The reason why am writing is I have several questions.
> 
> 1.) I am/was writing some tcl-tk-scripts to search for compounds and a 
> useful strategy which I used in other databases was to suppress all 
> unwanted elements to produce all possible structures in a composition 
> space which could be compared with experimental data. For the COD I 
> stumbled over a problem. Such a search produces a lot of structures 
> where the *composition formula **is missing.*

Please let us know more precisely how you define "composition formula".

The CIF framework [1] in the DDL1 [2] and DDLm [3] dictionaries specify 
at least 5 data names for specifying chemical formulae (and 3 names for 
specifying molecular weight) [4].

Of these, the COD actively manages only the '_chemical_formula_sum' data 
item. Formulae in this data item, if present, are parsed and 
standardised, if possible, for all incoming COD CIFs. In the COD 
database table `data` [5], it is reported in the `formula` column; 
additional column `calcformula` is computed by the COD software from 
atomic coordinates. Both formulae are recorded in Hill notation [6].

The '_chemical_formula_sum' data item are present in 100% of the COD 
records; also, `formula` and `calcformula` values are present in all 
rows in the COD `data` SQL table (just checked ;). Maybe you can use 
these data items when processing the COD in you software?

Other formula field of the CIF are not processed and not enforced; they 
are only present if the authors of the CIF provided them, and given in 
their original form (i.e. not standardised).

Also, if the _chemical_formula_sum can not be parsed by our software, it 
is left as is; thus some `formula` columns might be not conformant with 
the Hill syntax definition. The `calcformula` fields are always in Hill 
notation, but they can differ from the formula reported by the authors 
of hydrogens are not marked up, disorder is not recorded or Z value is 
missing in the original file and can not be estimated reliably.

> I also fond a number of entries *with missing fractional coordinates *or 
> where only a unit cell was specified but no structure was provided 
> (elements markes as "?").

 > Is it intended to have such structures in the database or are such
 > structures "unwanted"?

Indeed, for some of the COD entries miss coordinate data. The number of 
such structures is not large (1527 out of 508592, or below 0.3%).

We definitely keep these structures for several reasons:

- sometimes, we do not have the coordinates just because the structure 
is behind a paywall; in such cases we mark that the structure was 
published with as much information as we can get, and the coordinates 
can be supplemented later (e.g. if the original author provides them to 
the COD);

- some very old structures do not have the coordinates refined or 
published, but the unit cell and space group information might be 
valuable on its own;

- for retracted structures (unfortunately, there were some fraudulent 
structures published in scientific press), we delete the coordinates but 
keep the cell constants, summary formula and the symmetry, with the 
corresponding notes, to prevent redeposition of this structure by 
someone else in the future.

You can filter out the structures that do not have coordinates either by 
parsing the corresponding CIF file or by querying the COD SQL database:

> saulius at tasmanijos-velnias collection/ $ mysql -u cod_reader -h sql.crystallography.net cod -e 'select count(*) from data where flags like "%has coordinates%" and (status is null or status not like "%retracted%") and duplicateof is null and optimal is null'
> +----------+
> | count(*) |
> +----------+
> |   501668 |
> +----------+

> I saw that upon submission there is a quality check which I think should 
> stop such structures from deposition in the database.

Indeed, the submission scripts check the consistency of the provided 
data; in particular, they check that if one of the x,y,z coordinates is 
provided, then the remaining two are present as well. But the old 
publications might give crystal descriptions without coordinates...

> 2.) How can I mark a structure as disordered upon submission?

Disordered parts of the structure should be specified using 
_atom_site_disorder_group and _atom_site_disorder_assembly data items [7].

> 3.) How can I help to revise errors of structures which I find in the 
> database which have not been submitted by me?

Please report them to the cod-bugs at ibt.lt, with as much information as 
possible.

Sincerely yours,
Saulius

Refs.:

[1] IUCr COMCIFS. CIF Core dictionaries. URL: 
https://www.iucr.org/resources/cif/dictionaries/cif_core [accessed 
2023-11-28T11:12+02:00].

[2] IUCr COMCIFS. CIF Core DDL1 dictionary, v2.4.5 (2014). URL: 
https://www.iucr.org/__data/iucr/cif/dictionaries/cif_core_2.4.5.dic 
[accessed 2023-11-28T11:06+02:00].

[3] IUCr COMCIFS. CIF Core DDLm dictionary, v3.2.0 (2023). URL: 
https://www.iucr.org/__data/iucr/cif/dictionaries/cif_core_3.2.0.dic 
[accessed 2023-11-28T11:08+02:00].

[4] CIF Data names in the 'CHEMICAL_FORMULA' category:

> saulius at tasmanijos-velnias ~/ $ grep _definition.id cif_core_3.2.0.dic | awk -F"'" '{print $2}' | grep . | grep _chemical_formula'\.'
> _chemical_formula.analytical
> _chemical_formula.IUPAC
> _chemical_formula.moiety
> _chemical_formula.structural
> _chemical_formula.sum
> _chemical_formula.weight
> _chemical_formula.weight_meas
> _chemical_formula.weight_meas_su

URL: 
https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Cchemical_formula.html 
[accessed 2023-11-28T11:14+02:00].

[5] The COD Team. How to query the COD database. The COD Wiki (2023). 
URL: https://wiki.crystallography.net/howtoquerycod/ [accessed 
2023-11-28T11:17+02:00].

[6] Hill, E. A. On a system of indexing chemical literature; adopted by 
the classification division of the U. S. Patent Office. Journal of the 
American Chemical Society, American Chemical Society (ACS), 1900, 22, 
478-494, DOI: https://doi.org/10.1021/ja02046a005

[7] IUCr COMCIFS. Core dictionary (coreCIF) version 2.4.5. Atom site 
disorder group and assembly data items. URLs: 
https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_group.html 
; 
https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_disorder_assembly.html 
[accessed 2023-11-28T11:44+02:00].

-- 
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Cod-bugs mailing list