<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Dear Steef,</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">many thanks for your report on the
issues with the COD data! Your feedback is very valuable for us. I
have fixed some of problems (the file 7/70/81/7708164.cif should
now be OK); regarding others, I give my answers below.<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 2023-02-01 00:39, Steef Boerrigter
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">I am currently developing a program in the programming language of D
to read .cif files and process the contents to calculate various
things. I am sure I am just one of hundreds to have taken the
frustrating decision to try and write a comprehensive parser of "STAR"
formatted files.</pre>
</blockquote>
As a side note: if writing a CIF parser de-novo feels like
frustration, you may want to have a look at our CIF parser – maybe
it will be easier to link it with your program from cod-tools [1,2]
than to write a completely new one. Although the paper focuses on
the Perl implementation, there is a core parser ('cifparse') which
is in plain C, with Perl and Python bindings. It is rather portable
– one of my students recently linked it with a multi-tasking Ada
program :); it should not be that difficult to link it with D
either. The parser has also capability to correct some common
mistakes in CIF syntax, such as missing closing quotes.<br>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">During testing of my implementation, I came across two files that
clearly are corrupted. I deleted them on my mirror, re-synced and
received the exact same corrupted files.</pre>
</blockquote>
Which protocol did you use for synchronisation? I the latter case,
it would have probably helped to check our the file from the
Subversion repository (svn://crystallography.net/cod). Sure enough,
SVN is also not infallible, but it is a distribution route different
from 'rsync' and 'http(s)', so it may be useful to have such backup.
You can also 'svnsync' the whole repo to have a local read-only
copy.<br>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap=""> So, I am pretty sure the bitrot is on the COD server.
The files are
7/70/81/7708164.cif which has zero bytes.</pre>
</blockquote>
<p>This file was indeed damaged; many thanks for spotting it!</p>
<p>I have restored the file from the repository, and now both
'rsync' and 'http(s)' protocols should yield correct data. Please
have a look. The repository seems intact. I'm now comparing
checksums for the remaining files, to see if there are more
corrupt ones on the server. The 'bit rot' probably happened when
we had HDD failure some time ago.
</p>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">7/05/48/7054812.cif which goes into corruption at line 55186.</pre>
</blockquote>
<p>This file is a different story. The file itself is in fact
syntactically correct, served as in the repository, and most of
the data are intact. However, you are absolutely right, the
reflection list from the file is garbled at the end of the list.
Since the list itself is an a text field, a (correct) CIF parser
accepts the file. But the reflection list can not be used as it
is.</p>
<p>The problem comes from the original supplementary data of the
article [3]; the same corruption is on the line 66863. COD just
reproduces this situation.<br>
</p>
<p>I have written an e-mail to the authors of the original
publication. If they still have an original file and are ready to
share it with us, we will update the corresponding COD entry with
the correct HKL Fobs list. If they do not answer or do not have
the file, I think we will probably have to curate data by
truncating the reflection list at the reflection "15 -3 5
-7.40 8.00 166 0.27655 ...", and posting the corresponding
warning in the CIF. The truncated reflection list, even though
incomplete, should still be usable (e.g. one can still compute R
factors, re-refine the structure, etc.)</p>
<p>Please watch the updates (new revisions) of this file.<br>
</p>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">During testing, I further came across several hundred files that have rather questionable formatting choices that I would argue are either in violation with the CIF specification</pre>
</blockquote>
Well, most probably they are not in violation :). We went rather
carefully through the syntax definitions of CIF and the Tables, and
the discrepancies were analysed and fixed. The remaining syntax
(unless we overlooked something very nasty :) ) should satisfy the
specification of the CIF.<br>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap=""> or stretch the rules to the extent that it makes it almost impossible for any implementation to interpret the data correctly.</pre>
</blockquote>
<p>I would say there are a lot of implementations, including our
own, that parse most of the data correctly, including all symmetry
operators (this is what we use in our calculations).<br>
</p>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">To what extent are the maintainers interested in learning about my findings and potentially amending the entries to fix them?</pre>
</blockquote>
We are for sure interested to hear you ideas, and will fix things
wherever possible. We can, however, only take suggestions that have
absolutely firm mandate in the CIF standard.
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">Just to name one example. Apparently the program Maud produces the
spacegroup operators in the format (see 3/50/01/3500127.cif)
1 '-x+0.25, -y+0.25, -z+0.25'
as opposed to
1 '-x+1/4, -y+1/4, -z+1/4'
To my knowledge, none of the IUCR CIF guidelines, specs, website,
international tables ever use the decimal format for the translations.</pre>
</blockquote>
<p>Regarding decimal fractions: I have just additionally looked
though my copy of the Tables and the CIF dictionaries. That's
true, they never use decimal points as an example. But I also did
not find any place where it <i>forbids</i> the use of real
numbers in the way Maud does. What is not explicitly forbidden is
allowed.<br>
</p>
<p>The ITC vols. A and B talk about "real numbers" everywhere where
symmetry operator or matrix notation is involved [4], e.g.:</p>
<p>
<blockquote type="cite">The change-of-basis operator V has the
general form (v x , v y , v z ).<br>
The vectors v x , v y and v z are specified by<br>
<img src="cid:part1.Skd2TRKX.psoUXxeS@ibt.lt" alt=""><br>
where $r_{i,j}$ and $t_{i}$ are <i>fractions</i> or <i>real
numbers</i> (emphasis mine).<br>
</blockquote>
</p>
<p>As we see, the numbers are supposed to be <i>real numbers</i>,
and they are explicitly mentioned as distinct from <i>fractions</i>.
Thus, although all examples in the ITC indeed use vulgar fractions
for crystallographic translations, decimal fractions (a.k. <i>real
numbers</i>, or approximations thereof) seem to be assumed as
permissible.</p>
<p>At this point I get impression that neither CIF nor the Tables
are concerned with standardisation of computer-readable
representations; they just give mathematical definitions (<i>real
numbers</i>) and give examples of the notions in the text. <br>
</p>
<p>Further, the CIF data item definitions say [5]:</p>
<p>
<blockquote type="cite">
<pre> _item.name '_space_group_symop.operation_xyz'
# ...
_item_examples.detail
'x,1/2-y,1/2+z' 'c glide reflection through the plane (x,1/4,z)'
_item_description.description
; A <b>parsable string giving one of the symmetry operations</b> of the
space group in algebraic form.</pre>
</blockquote>
</p>
<p>No grammar for '_space_group_symop.operation_xyz' or related
fields is given.<br>
</p>
<p>I interpret these texts in the following way: all unambiguously <i>parsable</i>
symop descriptions should be accepted, <i>provided they have
crystallographic sense.</i> The interpreter should accept as
broad the range of syntaxes as possible; of course we should write
as narrow range as possible, but the latter is valid for one
single program and can not be applied to the collective database
like COD.</p>
<p>The operator '-x+0.25, -y+0.25, -z+0.25' is clearly parsable,
clearly unambiguous, and clearly crystallographically correct. I
therefore see no reason (formal or otherwise) to reject it.</p>
<p>Thus, in the COD, we do not convert decimal fractions in the
symmetry operators ('0.50') to vulgar fractions (1/2) if decimals
were present in the original file. It is expected that clients can
parse both notations (we did the conversion for coordinates,
though; some people specified atom coordinate 'y' as '1/4' or even
as 'x' – guess what <i>that</i> means... ;)<br>
</p>
<p>My suggestion (and our currently implemented symop parser
behaviour) is to treat symops in the following way:</p>
<p>1. accept all possible translations notations: 'x+7/6', '1/6+x',
'x-5/7' (it is the same as 'x+1/6', and not clear why one should
be preferred over another!), 'x+0.166667';</p>
<p>2. reconstruct all Seitz matrices from these notations;</p>
<p>3. reduce all translations "modulo 1" (i.e. '7/6' → '1.16667' →
'0.16667');</p>
<p>4. snap all crystallographic translations to the nearest
crystallographic value of your choice (i.e. '0.16667' → 1/6);</p>
<p>5. use rational arithmetic if you platform supports it;</p>
<p>6. Check whether your sympos are crystallographic and whether
they form a group (all symops that are necessary to reconstruct
the unit cell MUST be specified, as per CIF dictionaries).<br>
</p>
<p>This works, in my hands, for 100% of the COD symops and 99% of
the symops out there in the wild.<br>
</p>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">It is bad enough to have to program an exception to the standard fractional notation, but what happens with the 1/3 translation. </pre>
</blockquote>
Snap to the nearest crystallographic translation: 0.33333 → Rational
(1,3);<br>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">How many decimals should that get in this format.</pre>
</blockquote>
Standard IEEE 754 single precision float (at least 6 decimal digits)
is more than enough. In fact, even one digit '0.3' is closer to 1/2
than to 2/3; so if you "snap to crystallographic values", it should
work with any precision.<br>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">Even worse is that other entries list the translation as +0.500 </pre>
</blockquote>
<p>Why is this worse than '+0.5'? I would accept general computer
language floating point number notation here, defined by the
extended regexp:
'[-+]?([0-9]+(\.[0-9]\*)?|\.[0-9]+)([eE][-+]?[0-9]+)?'.<br>
</p>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">and I have seen '...z+1' and z+7/6. I mean, why? </pre>
</blockquote>
<p>In the COD, we tend to leave the symmetry operators as they were
encoded by the authors of the structure, as long as we can parse
them (and we can easily parse the above mentioned constructs). The
authors might have a good reason to include them in such as way;
and changing formatting just for the sake of changing formatting
might introduce extra errors and gives us extra work for no real
gain.</p>
<p>The 'z+1' is clearly the same as 'z', since all arithmetic is
modulo 1. Just take the fractional part of all translations you
get...<br>
</p>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">These are all non-standard translational operators that make it ...</pre>
</blockquote>
<p>To mark them as "non-standard" we first need to have a standard
to check against, and to my knowledge there is no explicit
standard so far that would specify the syntax of symmetry
operators.</p>
<p>We are working for such standard for OPTIMADE API [6], but so far
it is a draft, and will not pertain to CIF, just to the OPTMADE
APIs...<br>
</p>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">ridiculously difficult to map the operators to a space group.</pre>
</blockquote>
<p>I would respectfully disagree. Parsing the symmetry operators
that are currently in use is rather simple; we have implemented it
with no real difficulty. I agree that it is annoying when we have
to guess what other people have in mind and would rather have
explicit standard, but this is the state of the ar so far data
exchange, alas...</p>
<p>Also, we can not set any specific format for symops in the COD –
not only it may introduce more errors, but it will make some
people unhappy as well, e.g. someone may complain that they need
to deal with all these vulgar fractions ('1/2') instead of just
using standard C/Ada/whatever library to parse a "standard"
floating point number (e.g. "5.E-01"). The wishes will inevitable
become contradictory.<br>
</p>
<p>Moreover, since, as you have noted, there are programs (Maud)
that <i>do</i> use decimal floating point translations, your
parser will have to deal with them anyway, even if we would change
the convention in the COD. So I see no way around the
interpretation of all widespread symop encodings, otherwise you
will not be able to process some CIFs that are in the wild out
there (e.g. those from Maud...)
</p>
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">Allowing all these exceptions make things very challenging. I would argue it improves the quality of the data if these type of things are standardized.</pre>
</blockquote>
I agree that uniformity is helpful, but in the case of symops they
are, IMHO, uniform enough to enable automated processing of the
whole corpus of the COD data, there is no actual problem with it,
either when using existing libraries or when rolling out your own.
<blockquote type="cite"
cite="mid:CAFe6kZtgTnK7wx--wu50L2QXpxLTru-yS15csH4O7QVdZ02zOQ@mail.gmail.com">
<pre class="moz-quote-pre" wrap="">to what extent would you be willing to receive my findings or what are the possibilities for me to suggest edits?
</pre>
</blockquote>
<p>It is very useful for us to get feedback from you, but we can not
act on every proposal that we receive. We will fix obvious errors
and "data rot" ASAP (like 7708164.cif and 7054812.cif); but the
things like symop encoding we will leave unchanged, since I very
strongly insist that processing of <i>all</i> symop variants MUST
(as in RFC 2119) be implemented in every correct CIF library.</p>
<p>Sorry for a long e-mail... hope it will be somewhat helpful.</p>
<p>Sincerely,<br>
Saulius<br>
</p>
<p>Refs.:</p>
<p>[1] <b style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-style: normal; font-variant-caps:
normal; letter-spacing: normal; orphans: auto; text-align:
start; text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;"><i>Article</i><a
name="Merkys2016"><span class="Apple-converted-space"> </span>(Merkys2016)</a></b><span
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-size: medium; font-style: normal;
font-variant-caps: normal; font-weight: normal; letter-spacing:
normal; orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; display: inline !important; float: none;">
Merkys, A.; Vaitkus, A.; Butkus, J.; Okulič-Kazarinas, M.;
Kairys, V. & Gražulis, S. </span><i style="caret-color:
rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: sans-serif;
font-variant-caps: normal; font-weight: normal; letter-spacing:
normal; orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none;">COD::CIF::Parser</i><span
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-size: medium; font-style: normal;
font-variant-caps: normal; font-weight: normal; letter-spacing:
normal; orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; display: inline !important; float: none;">:
an error-correcting CIF parser for the Perl language.<span
class="Apple-converted-space"> </span></span><br
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-style: normal; font-variant-caps:
normal; font-weight: normal; letter-spacing: normal; orphans:
auto; text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;">
<em style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-variant-caps: normal; font-weight:
normal; letter-spacing: normal; orphans: auto; text-align:
start; text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;">Journal
of Applied Crystallography,<span class="Apple-converted-space"> </span></em><span
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-size: medium; font-style: normal;
font-variant-caps: normal; letter-spacing: normal; text-align:
start; text-indent: 0px; text-transform: none; white-space:
normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; display: inline !important; float: none;"></span><span
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-style: normal; font-variant-caps:
normal; letter-spacing: normal; orphans: auto; text-align:
start; text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;">2016</span><i
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-variant-caps: normal; font-weight:
normal; letter-spacing: normal; orphans: auto; text-align:
start; text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; text-decoration: none;">, 49</i><span
style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);
font-family: sans-serif; font-size: medium; font-style: normal;
font-variant-caps: normal; font-weight: normal; letter-spacing:
normal; orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
text-decoration: none; display: inline !important; float: none;">,
292-301, DOI: <a class="moz-txt-link-freetext" href="https://doi.org/10.1107/S1600576715022396">https://doi.org/10.1107/S1600576715022396</a></span></p>
<p>[2] Merkys, A. et. al. The 'cod-tools' package. URL:
<a class="moz-txt-link-freetext" href="https://github.com/cod-developers/cod-tools">https://github.com/cod-developers/cod-tools</a> [accessed
2023-02-01T15:06+02:00]</p>
<p>[3] Article "Activation of carbon dioxide by new mixed sandwich
uranium complexes ...", DOI: <a class="moz-txt-link-freetext" href="https://doi.org/10.1039/c5nj00590f">https://doi.org/10.1039/c5nj00590f</a> –
Supplementary files, Crystal structure data. URL:
<a class="moz-txt-link-freetext" href="https://www.rsc.org/suppdata/c5/nj/c5nj00590f/c5nj00590f2.cif">https://www.rsc.org/suppdata/c5/nj/c5nj00590f/c5nj00590f2.cif</a>
[accessed 2023-02-01T14:40+02:00]</p>
<p>[4] IUCr. <i>International Tables for Crystallography</i>
(2006). Vol. B, Chapter 1.4, pp. 99–161, "Symmetry in reciprocal
space".</p>
<p>[5] IUCr. Symmetry dictionary (symCIF), v1.0.1 (2005). URL:
<a class="moz-txt-link-freetext" href="https://www.iucr.org/__data/iucr/cif/dictionaries/cif_sym.dic">https://www.iucr.org/__data/iucr/cif/dictionaries/cif_sym.dic</a>
[accessed 2023-02-01T17:28+02:00]</p>
<p>[6] OPTIMADE issue #416: Insufficient space group descriptions.
URL: <a class="moz-txt-link-freetext" href="https://github.com/Materials-Consortia/OPTIMADE/issues/416">https://github.com/Materials-Consortia/OPTIMADE/issues/416</a>
[accessed 2023-02-01T19:22+02:00].<br>
<br>
</p>
<pre class="moz-signature" cols="72">--
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366
</pre>
</body>
</html>