[Cod-bugs] 2997 invalid files in C.O.D.
Saulius Gražulis
grazulis at ibt.lt
Sun Jul 9 11:57:43 EEST 2023
Dear David,
please let me highlight one more feature of the COD records which I
forgot to include into my yesterday's letter:
On 2023-07-05 12:59, David Palmer wrote:
> In the meantime, we have used our automated tools to analyse all
> current structures files. I am attaching a summary, listing file IDs
> and errors for 2,997 out of your 0.5M or so files: a relatively-small
> figure (ca. 0.6%). However, these files are invalid, and cannot be
> used for structural work, so I would recommend getting them fixed.
>
> The most common errors are:
>
> - missing fractional coordinates
Fractional coordinates can be represented by '.' special values for the
x,y and z coordinates in case of so called 'dummy' atoms [1]. There are
several examples of such COD entries in your list (file "Error Files
from COD (2023-07-04).txt"). For example, the coordinate section of the
COD 1001614 entry contains the following:
> loop_
> _atom_site_label
> # ... other data names omiited for brevity
> _atom_site_calc_flag
> # ... regular atom sites omitted
> H1 H1+ 4 e . . . 1 0 dum
> H2 H1+ 4 e . . . 0.8 0 dum
Likewise, the COD 1010499 entry contains:
> loop_
> _atom_site_label
> # ... other data names omiited for brevity
> _atom_site_calc_flag
> Hg1 Hg2+ 8 d 0.25 0.21 0.125 1. 0 d
> C1 C2+ 16 ? . . . 1 0 dum
> N1 N3- 16 ? . . . 1 0 dum
The atomic sites are marked as 'dum' in accordance with the IUCr
specification [1]. The IUCr does not prescribe any specific
interpretation for these lines, but we in the COD use the following
conventions:
- the atom with an existing atomic symbol from the periodic system (like
the 'H', 'C' or 'N' in the examples above) is considered as existing
somewhere in the unit cell, but the coordinates of the atom are not
determined.
Thus, in the examples above, the unit cell of 1001614 contains 1.8 x 4
extra electrons (and protons) per unit cell, on average, but we do not
/know/ where these atoms are located, not even the atom to which the
hydrogens are attached. The rest of data that are specified for these
sites are all relevant – we need to take multiplicity (4 in this case)
into account, and the Wyckoff letter tells us that we assume the
hydrogens are on general positions. The hydrogens carry a (+1) formal
charge. This allows us to check the electric neutrality of the cell,
provides corrections for F000 and makes it possible to calculate the
chemical formula. Your software may use this information for determining
Fcalc if you find it necessary.
Likewise, the 1010499 reports Hg atoms on special positions with
specified coordinates, and the remaining "light" atoms C and N with
undetermined coordinates. I interpret this record as follows: we know
that for a Mercury cyanide we need to have carbon and nitrogen present
(the formula is Hg(CN)2). The structure, however, was determined in 1926
(!), and with technologies of that time it is very likely that the
researchers did not "see" the carbon and the nitrogen positions (getting
Hg positions was already a feat!). Thus, we can calculate the total
number of electrons *and* the positions of Hg, but the locations of
lighter atoms need to be approximated or obtained by other means.
There are no errors in these entries; they faithfully represent
publications that are reported in their metadata and give the knowledge
available at that point.
- if a dummy atom has a label/chemical symbol that is /not/ in a
periodic system of elements, the we should assume that the site is
introduced for convenience purposes only (e.g. to measure distances in
some software); these atoms should be excluded from structure factor
calculations, even if they have coordinates.
NB: if the hydrogens do not have modelled coordinates but the
publication provides clear evidence to which heavy atoms these hydrogens
are attached, we indicate this by setting _atom_site_attached_hydrogens
of the site to a number more than 0, and no dummy atoms are used in this
case;
NB: unlike some other databases that could set occupancies of dummy
hydrogen sites to more than 1 (e.g. set them to 4 to indicate four
hydrogen atoms with unknown locations), we never use occupancies larger
than 1.0. This makes COD CIFs /valid/ with respect to the current IUCr
dictionaries. To specify more than one hydrogen atom, we specify several
dummy sites with occupancies <= 1.0 for each such site, as you see in
the 1001614 example above. To get the additional electron count yo would
need to sum up occupancies of all such sites (times the multiplicity of
their positions, of course).
Hope this clarifies the policy of the COD content.
Sincerely yours,
Saulius
Refs.:
[1] IUCr Core dictionary (coreCIF) version 2.4.5 _atom_site_calc_flag
(2023)
https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_calc_flag.html
[accessed 2023-07-09T11:24+03:00]
--
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230709/ca1589db/attachment.htm>
More information about the Cod-bugs
mailing list