<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Dear David,</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">please let me highlight one more
feature of the COD records which I forgot to include into my
yesterday's letter:<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 2023-07-05 12:59, David Palmer
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:E014914B-BB33-43BF-9AB6-9F1AB0342735@crystalmaker.com">
<div>In the meantime, we have used our automated tools to analyse
all current structures files. I am attaching a summary, listing
file IDs and errors for 2,997 out of your 0.5M or so files: a
relatively-small figure (ca. 0.6%). However, these files are
invalid, and cannot be used for structural work, so I would
recommend getting them fixed.</div>
<div><br>
</div>
<div>The most common errors are:</div>
<div><br>
</div>
<div>- missing fractional coordinates</div>
</blockquote>
<p>Fractional coordinates can be represented by '.' special values
for the x,y and z coordinates in case of so called 'dummy' atoms
[1]. There are several examples of such COD entries in your list
(file "Error Files from COD (2023-07-04).txt"). For example, the
coordinate section of the COD 1001614 entry contains the
following:</p>
<p>
<blockquote type="cite"><font face="monospace">loop_<br>
_atom_site_label<br>
# ... other data names omiited for brevity<br>
_atom_site_calc_flag<br>
# ... regular atom sites omitted<br>
H1 H1+ 4 e . . . 1 0 dum<br>
H2 H1+ 4 e . . . 0.8 0 dum</font><br>
</blockquote>
</p>
<p>Likewise, the COD 1010499 entry contains:</p>
<p>
<blockquote type="cite"><font face="monospace">loop_<br>
_atom_site_label<br>
# ... other data names omiited for brevity<br>
_atom_site_calc_flag<br>
Hg1 Hg2+ 8 d 0.25 0.21 0.125 1. 0 d<br>
C1 C2+ 16 ? . . . 1 0 dum<br>
N1 N3- 16 ? . . . 1 0 dum</font><br>
</blockquote>
</p>
<p>The atomic sites are marked as 'dum' in accordance with the IUCr
specification [1]. The IUCr does not prescribe any specific
interpretation for these lines, but we in the COD use the
following conventions:</p>
<p>- the atom with an existing atomic symbol from the periodic
system (like the 'H', 'C' or 'N' in the examples above) is
considered as existing somewhere in the unit cell, but the
coordinates of the atom are not determined. <br>
</p>
<p>Thus, in the examples above, the unit cell of 1001614 contains
1.8 x 4 extra electrons (and protons) per unit cell, on average,
but we do not <i>know</i> where these atoms are located, not even
the atom to which the hydrogens are attached. The rest of data
that are specified for these sites are all relevant – we need to
take multiplicity (4 in this case) into account, and the Wyckoff
letter tells us that we assume the hydrogens are on general
positions. The hydrogens carry a (+1) formal charge. This allows
us to check the electric neutrality of the cell, provides
corrections for F000 and makes it possible to calculate the
chemical formula. Your software may use this information for
determining Fcalc if you find it necessary.<br>
</p>
<p>Likewise, the 1010499 reports Hg atoms on special positions with
specified coordinates, and the remaining "light" atoms C and N
with undetermined coordinates. I interpret this record as follows:
we know that for a Mercury cyanide we need to have carbon and
nitrogen present (the formula is Hg(CN)2). The structure, however,
was determined in 1926 (!), and with technologies of that time it
is very likely that the researchers did not "see" the carbon and
the nitrogen positions (getting Hg positions was already a feat!).
Thus, we can calculate the total number of electrons *and* the
positions of Hg, but the locations of lighter atoms need to be
approximated or obtained by other means. <br>
</p>
<p>There are no errors in these entries; they faithfully represent
publications that are reported in their metadata and give the
knowledge available at that point.</p>
<p>- if a dummy atom has a label/chemical symbol that is <i>not</i>
in a periodic system of elements, the we should assume that the
site is introduced for convenience purposes only (e.g. to measure
distances in some software); these atoms should be excluded from
structure factor calculations, even if they have coordinates.</p>
<p>NB: if the hydrogens do not have modelled coordinates but the
publication provides clear evidence to which heavy atoms these
hydrogens are attached, we indicate this by setting
_atom_site_attached_hydrogens of the site to a number more than 0,
and no dummy atoms are used in this case;</p>
<p>NB: unlike some other databases that could set occupancies of
dummy hydrogen sites to more than 1 (e.g. set them to 4 to
indicate four hydrogen atoms with unknown locations), we never use
occupancies larger than 1.0. This makes COD CIFs <i>valid</i>
with respect to the current IUCr dictionaries. To specify more
than one hydrogen atom, we specify several dummy sites with
occupancies <= 1.0 for each such site, as you see in the
1001614 example above. To get the additional electron count yo
would need to sum up occupancies of all such sites (times the
multiplicity of their positions, of course).<br>
</p>
<p>Hope this clarifies the policy of the COD content.</p>
<p>Sincerely yours,<br>
Saulius<br>
</p>
<p>Refs.:</p>
<p>[1] IUCr Core dictionary (coreCIF) version 2.4.5
_atom_site_calc_flag (2023)
<a class="moz-txt-link-freetext" href="https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_calc_flag.html">https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_calc_flag.html</a>
[accessed 2023-07-09T11:24+03:00]<br>
</p>
<pre class="moz-signature" cols="72">--
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366
</pre>
<br />--
<br />This message has been scanned for viruses and
<br />dangerous content by
<a href="http://www.mailscanner.info/"><b>MailScanner</b></a>, and is
<br />believed to be clean.
</body>
</html>