<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Dear David,</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">thank you for your e-mail and the list
of issues that you have provided.<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">The feedback from the COD users, and of
course that includes your feedback, is very valuable for us. We do
our best to correct the COD entries if here are errors in them and
to make COD as accurate as possible. In doing so we strictly stick
to the definitions of the CIF provided by the IUCr and the best
current practices we are aware of in crystallography. Sometimes,
however, it is not possible to make all corrections that our users
request. Below, I'll give my comments on the issues you raise.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 2023-07-05 12:59, David Palmer
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:E014914B-BB33-43BF-9AB6-9F1AB0342735@crystalmaker.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
Dear Colleagues,
<div><br>
</div>
<div>I send you a message a few weeks ago about my plans to
provide easy phase ID via C.O.D.-hosted structures. I haven’t
heard back from you, so I assume you have no objections.</div>
</blockquote>
<p>I must admit that we have not received your previous mail; it is
possible that the e-mail was lost on the way since we had some
mail server failures in our university. In any case, from you
current letter I understand that you would like to provide
material identification software based on the COD and make it
public. If this is so, them we have absolutely no objections for
that, in fact he COD exists to make such projects possible! Of
course please advise your users that they cite the original
publications that produced data records in the COD if specific
records are used, as is customary in scientific practice, and we
would appreciate citation and reference of the COD as well, where
relevant.</p>
<p>As a side note, we never abbreviate our database as the 'C.O.D.'
(with periods); it is usually written as an initialism 'COD'.<br>
</p>
<blockquote type="cite"
cite="mid:E014914B-BB33-43BF-9AB6-9F1AB0342735@crystalmaker.com">
<div><br>
</div>
<div>In the meantime, we have used our automated tools to analyse
all current structures files. I am attaching a summary, listing
file IDs and errors for 2,997 out of your 0.5M or so files: a
relatively-small figure (ca. 0.6%). However, these files are
invalid, and cannot be used for structural work, so I would
recommend getting them fixed.</div>
<div><br>
</div>
</blockquote>
<p>Thanks for providing the list of the files that failed
processing, we will have a close look into them.</p>
<p>As a note, the term "valid" in the CIF framework has a quite
specific meaning – it means that the structure CIFs are valid
according to some declared CIF dictionaries. The invalid files may
or may not be suitable for structural work, and may or may not be
amenable for corrections.</p>
<p>Currently, three levels of checks are performed in the COD, with
the following guarantees we provide:</p>
<p>- a syntax check. We guarantee that the CIFs from the COD are
conformant to the syntax declared by the IUCr, using our CIF
parser [1] and other parsers in the field. This ensures that the
COD files can be processed in an automated way. Thus, if you spot
a syntactically wrong file, please report it and we will fix that
immediately; the file has to be checked against the IUCr CIF
grammar.<br>
</p>
<p>- a dictionary check. The files that validate against the IUCr
dictionaries are using the data elements in an intended way.
Though many files in the COD are indeed valid in this sense, a
substantial portion of them raises one or several validation
issues (we compiled over 11 mln. validation messages from the
current COD collection). We look into them and search for
systematic ways to correct the most serious ones, but this is an
on-going work and the full validity can not be practically
achieved at the moment;</p>
<p>- we do certain COD specific checks (e.g. checking that all three
coordinate data items, _atom_site_fract_{x,y,z} are present). This
is supposed to catch most obvious mistakes in the data files, but
can only be used for improving the COD records if we get hold on
correct original data.</p>
<p>Before we go into more details about the issues you report, let
me draw you attention to one feature of the CIF framework that
will be important:</p>
<p>the CIF files MAY (as in RFC 2119) contain special values '?' and
'.' (without the quotes) as values for any data item in the file.
The files that contain such values are both syntactically correct
and valid in the sense defined above (i.e. such values validate
against CIF dictionaries). The '?' value, as we understand it,
denotes that the actual value of the data item is not know (but
may become known in the future). The value '.' denotes that the
value is not relevant, or does not exist at all. We sometimes use
these values to indicate special situations in the COD files; they
can also be used as atom coordinate values. Any CIF compliant
software should be prepared to deal with such values.<br>
</p>
<blockquote type="cite"
cite="mid:E014914B-BB33-43BF-9AB6-9F1AB0342735@crystalmaker.com">
<div>The most common errors are:</div>
<div><br>
</div>
<div>- missing fractional coordinates</div>
</blockquote>
<p>There are several occasions when coordinate values are missing;
let me illustrate them from the list the you have provided:</p>
<p>- 2217080: this entry contains '.' as atom coordinates for a
serious reason: the structure that was published in a
peer-reviewed article appeared to be fake and was retracted. To
avoid erroneous calculations, the original coordinate values were
replaced by '.', marking them as irrelevant, and the entry is
marked as retracted. It is retained in the COD database as a
historic record and to prevent its renewed deposition. The exact
reasons for retraction are documented in the COD CIF file, and the
references to relevant IUCr editorials are given.<br>
</p>
<p>You may want to filter out retracted entries, either by checking
the '_cod_entry_issue_severity' data item or by querying status in
our SQL database:</p>
<p>
<blockquote type="cite"><font face="monospace">mysql -u cod_reader
-h sql.crystallography.net cod -e 'select file from data where
status not like "%retracted%" or status is NULL'</font></blockquote>
</p>
<p>There are more flags that you may want to filter out (suboptimal
structures, duplicates, structures without coordinates, structures
with warnings, etc.); please check our Wiki from the COD Web page
for full documentation.<br>
</p>
<p>- 1000195: this entry contains '?' as coordinates, indicating
that they are unknown. Looking at the publication year (1962) I
realise that this is the very old publication; we do not have the
paper at hand, and it is also likely that the coordinates were not
reported for some compounds at these dates, only cell parameters.</p>
<p>The COD entries of this kind are provided to indicated that the
publication existed, and to provide the information currently
known (cell parameters, chemical composition, crystal symmetry).
This information is already enough for some kinds of computations
(e.g. as initial approximations for DFT).</p>
<p>If we ever get the original publication and the coordinates are
published there, we will insert them in the new revision of this
entry. If you have access to the original publication, we would be
grateful if you share it (or the updated CIF ;) with us.</p>
<p>- 5900030: in this entry, the x coordinate has values '.' since
these values were not determined in the original publication;
while physically the x coordinate is defined for the structure, it
is not available from this particular publication (i.e. we have no
chance to recover it from published data). Other data values, such
as cell constants and the y-z coordinates of the projection are
available and can be used.<br>
</p>
<blockquote type="cite"
cite="mid:E014914B-BB33-43BF-9AB6-9F1AB0342735@crystalmaker.com">
<div>- ambiguous site labelling</div>
</blockquote>
<p>I am not quite sure what problem you mean there. One known issue
is that some structures do have duplicate atom labels. This is an
error, and we will fix it with time. This involves a fair amount
of manual checking however, so I can not promise we do it fast.</p>
<p>For the moment, a possible workaround would be to add unique
suffix to such atom labels during the structure interpretation and
then process the structure as usual.<br>
</p>
<blockquote type="cite"
cite="mid:E014914B-BB33-43BF-9AB6-9F1AB0342735@crystalmaker.com">
<div>- invalid element symbols</div>
</blockquote>
<p>This is a known issue, especially with atoms from AMCSD that have
custom labelling scheme.</p>
<p>Fortunately, the new version of AMCSD has a new consistent atom
naming, and we could assign atom types semi-automatically for
these entries. Incidently, I have just finished analysis and
assignments of atom types to those entries.<br>
</p>
<p>Please check out the COD revision 285101 – it should have most of
the atoms with the correct types assigned. As per my checks, only
45 COD entries remain that still have unrecognised atom types (if
you take _atom_site_type_label into account, of course). Some of
these are indeed unknown atoms, such as metal sites with uncertain
identity.</p>
<p>Please let us know how this revision scores with your software!<br>
</p>
<blockquote type="cite"
cite="mid:E014914B-BB33-43BF-9AB6-9F1AB0342735@crystalmaker.com">
<div><br>
</div>
<div>A common issue is a mismatch between site labels in different
data blocks (e.g., a table of anisotropic displacement
parameters and a table of fractional coordinates). </div>
</blockquote>
Just a bit of nit-picking on terminology – all COD files contain
just one data block (it starts with a unique data_... header in each
CIF). ADPs and coordinates are usually located in different <i>loops</i>
in the same data block.<br>
<blockquote type="cite"
cite="mid:E014914B-BB33-43BF-9AB6-9F1AB0342735@crystalmaker.com">
<div>We found these errors in numerous files submitted via the <b>American
Mineralogist crystal structures database</b> (clearly,
substantial amounts of U.S. governmental funding failed to
prevent basic transcription errors!)</div>
</blockquote>
<p>To all fairness, I would say that Bob Downs and his team make a
good job collecting all minerals; without AMCSD contribution, our
COD collection of minerals would have been much shabbier. They are
constantly improving their collection (I'm constantly in touch
with Bob on these matters), and heir recent work enabled us to
assign atom types with reasonable work effort. As for the funding,
I'm not sure if they get substantial amounts of it; I am aware of
several startup grants they had, and I think they used them as
good as they could.</p>
<p>This does not mean that the matters can not be improved :), and
we are working on that as well. The discrepancy of the labels in
the Uij and xyz loops is a known issue that appeared in the recent
update. We are working with Bob to rectify this, but this will
take some while. In between, I have a suggestion of a workaround
below:<br>
</p>
<blockquote type="cite"
cite="mid:E014914B-BB33-43BF-9AB6-9F1AB0342735@crystalmaker.com">
<div><br>
</div>
<div>Take the following file, 9003355, as an example:-</div>
<div><br>
</div>
<div>• Sites SiT1’, AlT1’ (etc.) are listed in the loop containing
Uij</div>
<div>• The same site are labelling differently (e.g., SiT1*,
AlT1*, etc.) in the loop containing xyz</div>
<div><br>
</div>
<div>Whilst, to a human, one could make inferences as to how these
labels should be related, a computer cannot make such a
judgement, thereby rendering these files useless.</div>
</blockquote>
<p>I agree that humans can match the labels, and potentially fix
them; we have no manpower however to go through these lists
manually, and even then the manual editing would be error-prone.
We could apply a heuristics that an apostrophe ("'") in one loop
corresponds to the asterisk ("*") in the other loop and make an
automatic correction, but the results still needs to be checked
manually (I am reluctant to commit to the COD changes that are
based on broad guesses); also, there are some other patterns in
place (e.g. 'OH' vs 'O-H' change in labels).</p>
<p>From the error messages in the log file that you sent us, I have
impression that your program looks for an atom label in the
_atom_site_aniso_label (aka Uij) loop, and then tries to find the
corresponding _atom_site_label in the coordinate loop. This will
fail not only when the labels do not match but also when the atom
is not mentioned in the _atom_site_aniso_label loop <i>at all</i>.
Since not all atoms are refined anisotropically, some of them can
be legitimately left out from the Uij loop, but have them in the
_atom_site_fract_x loop; such files are perfectly valid and
usable.</p>
<p>May I suggest a workaround for the processing of such files –
let's to look first in the coordinate loop for the
_atom_site_label to identify all atoms, and then look up the
anisotropic displacement parameters Uij in the
_atom_site_aniso_labelloop if they exist. If they do not, it is
often possible to use Uiso instead, and I bet this will be a fair
approximation even for anisotropically refined atoms. In this way
you will correctly process all correct files and have a reasonable
approximate data for the files that are currently mislabelled. In
the future we will correct the Uij<->xyz label
correspondence (our validator detects them), and you can then
recalculate your outputs with the new COD revision, getting more
accurate results. I can let you know when such revision is issued
in the COD, but please ping me after some time since I can forget
:)<br>
</p>
<p>Of course one can also apply the heuristics mentioned above, or
skip such entries with mismatches altogether, until the new COD
revision is in place.</p>
<p>Hope this clarifies the COD data contents and the way we address
the detected problems.</p>
<p>Once more thank you for your report!<br>
</p>
<blockquote type="cite"
cite="mid:E014914B-BB33-43BF-9AB6-9F1AB0342735@crystalmaker.com">
<div><br>
</div>
<div>I hope this helps, and do let me know if you have any
questions.</div>
<div><br>
</div>
<div>With best wishes,</div>
<div>Yours faithfully,</div>
<div><br>
</div>
<div>David Palmer</div>
<div><br>
</div>
<div>
<div style="font-family: MinionPro-Regular;">David C Palmer,
Ph.D. (Cantab), M.A. (Cantab),</div>
<div style="font-family: MinionPro-Regular;">Managing Director,
CrystalMaker Software Ltd</div>
<div style="font-family: MinionPro-Regular;">Centre for
Innovation & Enterprise | Oxford University Begbroke
Science Park</div>
<div style="font-family: MinionPro-Regular;">Woodstock Road,
Begbroke, Oxfordshire, OX5 1PF, UK</div>
</div>
<br>
</blockquote>
<p>Sincerely yours,<br>
Saulius<br>
</p>
<p>References:<br>
</p>
<p>[1] <mark data-markjs="true" style="background-color: orange;
color: black; caret-color: rgb(0, 0, 0); font-family:
sans-serif; font-style: normal; font-variant-caps: normal;
font-weight: normal; letter-spacing: normal; orphans: auto;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;">Merkys</mark><span
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-size: medium; font-style: normal;
font-variant-caps: normal; font-weight: normal; letter-spacing:
normal; orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; display: inline !important; float: none;">,
A.; Vaitkus, A.; Butkus, J.; Okulič-Kazarinas, M.; Kairys, V.
& Gražulis, S.</span><br style="caret-color: rgb(0, 0, 0);
color: rgb(0, 0, 0); font-family: sans-serif; font-style:
normal; font-variant-caps: normal; font-weight: normal;
letter-spacing: normal; orphans: auto; text-align: start;
text-indent: 0px; text-transform: none; white-space: normal;
widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;">
<i style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-variant-caps: normal; font-weight:
normal; letter-spacing: normal; orphans: auto; text-align:
start; text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;">COD::CIF::Parser</i><span
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-size: medium; font-style: normal;
font-variant-caps: normal; font-weight: normal; letter-spacing:
normal; orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; display: inline !important; float: none;">:
an error-correcting CIF parser for the Perl language.<span
class="Apple-converted-space"> </span></span><br
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-style: normal; font-variant-caps:
normal; font-weight: normal; letter-spacing: normal; orphans:
auto; text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;">
<em style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-variant-caps: normal; font-weight:
normal; letter-spacing: normal; orphans: auto; text-align:
start; text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;">Journal
of Applied Crystallography,<span class="Apple-converted-space"> </span></em><span
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-size: medium; font-style: normal;
font-variant-caps: normal; font-weight: normal; letter-spacing:
normal; orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; display: inline !important; float: none;"></span><b
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-style: normal; font-variant-caps:
normal; letter-spacing: normal; orphans: auto; text-align:
start; text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;">2016</b><i
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-variant-caps: normal; font-weight:
normal; letter-spacing: normal; orphans: auto; text-align:
start; text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;">, 49</i><span
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-size: medium; font-style: normal;
font-variant-caps: normal; font-weight: normal; letter-spacing:
normal; orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; display: inline !important; float: none;">,
292-301, DOI: <a class="moz-txt-link-freetext" href="https://doi.org/10.1107/S1600576715022396">https://doi.org/10.1107/S1600576715022396</a></span></p>
<pre class="moz-signature" cols="72">--
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366
</pre>
<br />--
<br />This message has been scanned for viruses and
<br />dangerous content by
<a href="http://www.mailscanner.info/"><b>MailScanner</b></a>, and is
<br />believed to be clean.
</body>
</html>