[Cod-bugs] 2997 invalid files in C.O.D.

Saulius Gražulis grazulis at ibt.lt
Sun Jul 9 11:57:43 EEST 2023


Dear David,

please let me highlight one more feature of the COD records which I 
forgot to include into my yesterday's letter:

On 2023-07-05 12:59, David Palmer wrote:
> In the meantime, we have used our automated tools to analyse all 
> current structures files. I am attaching a summary, listing file IDs 
> and errors for 2,997 out of your 0.5M or so files: a relatively-small 
> figure (ca. 0.6%). However, these files are invalid, and cannot be 
> used for structural work, so I would recommend getting them fixed.
>
> The most common errors are:
>
> - missing fractional coordinates

Fractional coordinates can be represented by '.' special values for the 
x,y and z coordinates in case of so called 'dummy' atoms [1].  There are 
several examples of such COD entries in your list (file "Error Files 
from COD (2023-07-04).txt"). For example, the coordinate section of the 
COD 1001614 entry contains the following:

> loop_
> _atom_site_label
> # ... other data names omiited for brevity
> _atom_site_calc_flag
> # ... regular atom sites omitted
> H1 H1+ 4 e . . . 1 0 dum
> H2 H1+ 4 e . . . 0.8 0 dum

Likewise, the COD 1010499 entry contains:

> loop_
> _atom_site_label
> # ... other data names omiited for brevity
> _atom_site_calc_flag
> Hg1 Hg2+ 8 d 0.25 0.21 0.125 1. 0 d
> C1 C2+ 16 ? . . . 1 0 dum
> N1 N3- 16 ? . . . 1 0 dum

The atomic sites are marked as 'dum' in accordance with the IUCr 
specification [1]. The IUCr does not prescribe any specific 
interpretation for these lines, but we in the COD use the following 
conventions:

- the atom with an existing atomic symbol from the periodic system (like 
the 'H', 'C' or 'N' in the examples above) is considered as existing 
somewhere in the unit cell, but the coordinates of the atom are not 
determined.

Thus, in the examples above, the unit cell of 1001614 contains 1.8 x 4 
extra electrons (and protons) per unit cell, on average, but we do not 
/know/ where these atoms are located, not even the atom to which the 
hydrogens are attached. The rest of data that are specified for these 
sites are all relevant – we need to take multiplicity (4 in this case) 
into account, and the Wyckoff letter tells us that we assume the 
hydrogens are on general positions. The hydrogens carry a (+1) formal 
charge. This allows us to check the electric neutrality of the cell, 
provides corrections for F000 and makes it possible to calculate the 
chemical formula. Your software may use this information for determining 
Fcalc if you find it necessary.

Likewise, the 1010499 reports Hg atoms on special positions with 
specified coordinates, and the remaining "light" atoms C and N with 
undetermined coordinates. I interpret this record as follows: we know 
that for a Mercury cyanide we need to have carbon and nitrogen present 
(the formula is Hg(CN)2). The structure, however, was determined in 1926 
(!), and with technologies of that time it is very likely that the 
researchers did not "see" the carbon and the nitrogen positions (getting 
Hg positions was already a feat!). Thus, we can calculate the total 
number of electrons *and* the positions of Hg, but the locations of 
lighter atoms need to be approximated or obtained by other means.

There are no errors in these entries; they faithfully represent 
publications that are reported in their metadata and give the knowledge 
available at that point.

- if a dummy atom has a label/chemical symbol that is /not/ in a 
periodic system of elements, the we should assume that the site is 
introduced for convenience purposes only (e.g. to measure distances in 
some software); these atoms should be excluded from structure factor 
calculations, even if they have coordinates.

NB: if the hydrogens do not have modelled coordinates but the 
publication provides clear evidence to which heavy atoms these hydrogens 
are attached, we indicate this by setting _atom_site_attached_hydrogens 
of the site to a number more than 0, and no dummy atoms are used in this 
case;

NB: unlike some other databases that could set occupancies of dummy 
hydrogen sites to more than 1 (e.g. set them to 4 to indicate four 
hydrogen atoms with unknown locations), we never use occupancies larger 
than 1.0. This makes COD CIFs /valid/ with respect to the current IUCr 
dictionaries. To specify more than one hydrogen atom, we specify several 
dummy sites with occupancies <= 1.0 for each such site, as you see in 
the 1001614 example above. To get the additional electron count yo would 
need to sum up occupancies of all such sites (times the multiplicity of 
their positions, of course).

Hope this clarifies the policy of the COD content.

Sincerely yours,
Saulius

Refs.:

[1] IUCr Core dictionary (coreCIF) version 2.4.5 _atom_site_calc_flag 
(2023) 
https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_calc_flag.html 
[accessed 2023-07-09T11:24+03:00]

-- 
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230709/ca1589db/attachment.htm>


More information about the Cod-bugs mailing list