[Cod-bugs] Corrupted files in COD
Saulius Gražulis
grazulis at ibt.lt
Wed Feb 1 19:59:12 EET 2023
Dear Steef,
many thanks for your report on the issues with the COD data! Your
feedback is very valuable for us. I have fixed some of problems (the
file 7/70/81/7708164.cif should now be OK); regarding others, I give my
answers below.
On 2023-02-01 00:39, Steef Boerrigter wrote:
> I am currently developing a program in the programming language of D
> to read .cif files and process the contents to calculate various
> things. I am sure I am just one of hundreds to have taken the
> frustrating decision to try and write a comprehensive parser of "STAR"
> formatted files.
As a side note: if writing a CIF parser de-novo feels like frustration,
you may want to have a look at our CIF parser – maybe it will be easier
to link it with your program from cod-tools [1,2] than to write a
completely new one. Although the paper focuses on the Perl
implementation, there is a core parser ('cifparse') which is in plain C,
with Perl and Python bindings. It is rather portable – one of my
students recently linked it with a multi-tasking Ada program :); it
should not be that difficult to link it with D either. The parser has
also capability to correct some common mistakes in CIF syntax, such as
missing closing quotes.
> During testing of my implementation, I came across two files that
> clearly are corrupted. I deleted them on my mirror, re-synced and
> received the exact same corrupted files.
Which protocol did you use for synchronisation? I the latter case, it
would have probably helped to check our the file from the Subversion
repository (svn://crystallography.net/cod). Sure enough, SVN is also not
infallible, but it is a distribution route different from 'rsync' and
'http(s)', so it may be useful to have such backup. You can also
'svnsync' the whole repo to have a local read-only copy.
> So, I am pretty sure the bitrot is on the COD server.
>
> The files are
> 7/70/81/7708164.cif which has zero bytes.
This file was indeed damaged; many thanks for spotting it!
I have restored the file from the repository, and now both 'rsync' and
'http(s)' protocols should yield correct data. Please have a look. The
repository seems intact. I'm now comparing checksums for the remaining
files, to see if there are more corrupt ones on the server. The 'bit
rot' probably happened when we had HDD failure some time ago.
> 7/05/48/7054812.cif which goes into corruption at line 55186.
This file is a different story. The file itself is in fact syntactically
correct, served as in the repository, and most of the data are intact.
However, you are absolutely right, the reflection list from the file is
garbled at the end of the list. Since the list itself is an a text
field, a (correct) CIF parser accepts the file. But the reflection list
can not be used as it is.
The problem comes from the original supplementary data of the article
[3]; the same corruption is on the line 66863. COD just reproduces this
situation.
I have written an e-mail to the authors of the original publication. If
they still have an original file and are ready to share it with us, we
will update the corresponding COD entry with the correct HKL Fobs list.
If they do not answer or do not have the file, I think we will probably
have to curate data by truncating the reflection list at the reflection
"15 -3 5 -7.40 8.00 166 0.27655 ...", and posting the
corresponding warning in the CIF. The truncated reflection list, even
though incomplete, should still be usable (e.g. one can still compute R
factors, re-refine the structure, etc.)
Please watch the updates (new revisions) of this file.
> During testing, I further came across several hundred files that have rather questionable formatting choices that I would argue are either in violation with the CIF specification
Well, most probably they are not in violation :). We went rather
carefully through the syntax definitions of CIF and the Tables, and the
discrepancies were analysed and fixed. The remaining syntax (unless we
overlooked something very nasty :) ) should satisfy the specification of
the CIF.
> or stretch the rules to the extent that it makes it almost impossible for any implementation to interpret the data correctly.
I would say there are a lot of implementations, including our own, that
parse most of the data correctly, including all symmetry operators (this
is what we use in our calculations).
> To what extent are the maintainers interested in learning about my findings and potentially amending the entries to fix them?
We are for sure interested to hear you ideas, and will fix things
wherever possible. We can, however, only take suggestions that have
absolutely firm mandate in the CIF standard.
> Just to name one example. Apparently the program Maud produces the
> spacegroup operators in the format (see 3/50/01/3500127.cif)
> 1 '-x+0.25, -y+0.25, -z+0.25'
> as opposed to
> 1 '-x+1/4, -y+1/4, -z+1/4'
> To my knowledge, none of the IUCR CIF guidelines, specs, website,
> international tables ever use the decimal format for the translations.
Regarding decimal fractions: I have just additionally looked though my
copy of the Tables and the CIF dictionaries. That's true, they never use
decimal points as an example. But I also did not find any place where it
/forbids/ the use of real numbers in the way Maud does. What is not
explicitly forbidden is allowed.
The ITC vols. A and B talk about "real numbers" everywhere where
symmetry operator or matrix notation is involved [4], e.g.:
> The change-of-basis operator V has the general form (v x , v y , v z ).
> The vectors v x , v y and v z are specified by
>
> where $r_{i,j}$ and $t_{i}$ are /fractions/ or /real numbers/
> (emphasis mine).
As we see, the numbers are supposed to be /real numbers/, and they are
explicitly mentioned as distinct from /fractions/. Thus, although all
examples in the ITC indeed use vulgar fractions for crystallographic
translations, decimal fractions (a.k. /real numbers/, or approximations
thereof) seem to be assumed as permissible.
At this point I get impression that neither CIF nor the Tables are
concerned with standardisation of computer-readable representations;
they just give mathematical definitions (/real numbers/) and give
examples of the notions in the text.
Further, the CIF data item definitions say [5]:
> _item.name '_space_group_symop.operation_xyz'
> # ...
> _item_examples.detail
> 'x,1/2-y,1/2+z' 'c glide reflection through the plane (x,1/4,z)'
> _item_description.description
> ; A*parsable string giving one of the symmetry operations* of the
> space group in algebraic form.
No grammar for '_space_group_symop.operation_xyz' or related fields is
given.
I interpret these texts in the following way: all unambiguously
/parsable/ symop descriptions should be accepted, /provided they have
crystallographic sense./ The interpreter should accept as broad the
range of syntaxes as possible; of course we should write as narrow range
as possible, but the latter is valid for one single program and can not
be applied to the collective database like COD.
The operator '-x+0.25, -y+0.25, -z+0.25' is clearly parsable, clearly
unambiguous, and clearly crystallographically correct. I therefore see
no reason (formal or otherwise) to reject it.
Thus, in the COD, we do not convert decimal fractions in the symmetry
operators ('0.50') to vulgar fractions (1/2) if decimals were present in
the original file. It is expected that clients can parse both notations
(we did the conversion for coordinates, though; some people specified
atom coordinate 'y' as '1/4' or even as 'x' – guess what /that/ means... ;)
My suggestion (and our currently implemented symop parser behaviour) is
to treat symops in the following way:
1. accept all possible translations notations: 'x+7/6', '1/6+x', 'x-5/7'
(it is the same as 'x+1/6', and not clear why one should be preferred
over another!), 'x+0.166667';
2. reconstruct all Seitz matrices from these notations;
3. reduce all translations "modulo 1" (i.e. '7/6' → '1.16667' → '0.16667');
4. snap all crystallographic translations to the nearest
crystallographic value of your choice (i.e. '0.16667' → 1/6);
5. use rational arithmetic if you platform supports it;
6. Check whether your sympos are crystallographic and whether they form
a group (all symops that are necessary to reconstruct the unit cell MUST
be specified, as per CIF dictionaries).
This works, in my hands, for 100% of the COD symops and 99% of the
symops out there in the wild.
> It is bad enough to have to program an exception to the standard fractional notation, but what happens with the 1/3 translation.
Snap to the nearest crystallographic translation: 0.33333 → Rational (1,3);
> How many decimals should that get in this format.
Standard IEEE 754 single precision float (at least 6 decimal digits) is
more than enough. In fact, even one digit '0.3' is closer to 1/2 than to
2/3; so if you "snap to crystallographic values", it should work with
any precision.
> Even worse is that other entries list the translation as +0.500
Why is this worse than '+0.5'? I would accept general computer language
floating point number notation here, defined by the extended regexp:
'[-+]?([0-9]+(\.[0-9]\*)?|\.[0-9]+)([eE][-+]?[0-9]+)?'.
> and I have seen '...z+1' and z+7/6. I mean, why?
In the COD, we tend to leave the symmetry operators as they were encoded
by the authors of the structure, as long as we can parse them (and we
can easily parse the above mentioned constructs). The authors might have
a good reason to include them in such as way; and changing formatting
just for the sake of changing formatting might introduce extra errors
and gives us extra work for no real gain.
The 'z+1' is clearly the same as 'z', since all arithmetic is modulo 1.
Just take the fractional part of all translations you get...
> These are all non-standard translational operators that make it ...
To mark them as "non-standard" we first need to have a standard to check
against, and to my knowledge there is no explicit standard so far that
would specify the syntax of symmetry operators.
We are working for such standard for OPTIMADE API [6], but so far it is
a draft, and will not pertain to CIF, just to the OPTMADE APIs...
> ridiculously difficult to map the operators to a space group.
I would respectfully disagree. Parsing the symmetry operators that are
currently in use is rather simple; we have implemented it with no real
difficulty. I agree that it is annoying when we have to guess what other
people have in mind and would rather have explicit standard, but this is
the state of the ar so far data exchange, alas...
Also, we can not set any specific format for symops in the COD – not
only it may introduce more errors, but it will make some people unhappy
as well, e.g. someone may complain that they need to deal with all these
vulgar fractions ('1/2') instead of just using standard C/Ada/whatever
library to parse a "standard" floating point number (e.g. "5.E-01"). The
wishes will inevitable become contradictory.
Moreover, since, as you have noted, there are programs (Maud) that /do/
use decimal floating point translations, your parser will have to deal
with them anyway, even if we would change the convention in the COD. So
I see no way around the interpretation of all widespread symop
encodings, otherwise you will not be able to process some CIFs that are
in the wild out there (e.g. those from Maud...)
> Allowing all these exceptions make things very challenging. I would argue it improves the quality of the data if these type of things are standardized.
I agree that uniformity is helpful, but in the case of symops they are,
IMHO, uniform enough to enable automated processing of the whole corpus
of the COD data, there is no actual problem with it, either when using
existing libraries or when rolling out your own.
> to what extent would you be willing to receive my findings or what are the possibilities for me to suggest edits?
It is very useful for us to get feedback from you, but we can not act on
every proposal that we receive. We will fix obvious errors and "data
rot" ASAP (like 7708164.cif and 7054812.cif); but the things like symop
encoding we will leave unchanged, since I very strongly insist that
processing of /all/ symop variants MUST (as in RFC 2119) be implemented
in every correct CIF library.
Sorry for a long e-mail... hope it will be somewhat helpful.
Sincerely,
Saulius
Refs.:
[1] */Article/(Merkys2016)*Merkys, A.; Vaitkus, A.; Butkus, J.;
Okulič-Kazarinas, M.; Kairys, V. & Gražulis, S. /COD::CIF::Parser/: an
error-correcting CIF parser for the Perl language.
/Journal of Applied Crystallography,/2016/, 49/, 292-301, DOI:
https://doi.org/10.1107/S1600576715022396
[2] Merkys, A. et. al. The 'cod-tools' package. URL:
https://github.com/cod-developers/cod-tools [accessed
2023-02-01T15:06+02:00]
[3] Article "Activation of carbon dioxide by new mixed sandwich uranium
complexes ...", DOI: https://doi.org/10.1039/c5nj00590f – Supplementary
files, Crystal structure data. URL:
https://www.rsc.org/suppdata/c5/nj/c5nj00590f/c5nj00590f2.cif [accessed
2023-02-01T14:40+02:00]
[4] IUCr. /International Tables for Crystallography/ (2006). Vol. B,
Chapter 1.4, pp. 99–161, "Symmetry in reciprocal space".
[5] IUCr. Symmetry dictionary (symCIF), v1.0.1 (2005). URL:
https://www.iucr.org/__data/iucr/cif/dictionaries/cif_sym.dic [accessed
2023-02-01T17:28+02:00]
[6] OPTIMADE issue #416: Insufficient space group descriptions. URL:
https://github.com/Materials-Consortia/OPTIMADE/issues/416 [accessed
2023-02-01T19:22+02:00].
--
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230201/41637b98/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: a7PSVcWbVgFzm0pa.png
Type: image/png
Size: 4917 bytes
Desc: not available
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230201/41637b98/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230201/41637b98/attachment-0001.sig>
More information about the Cod-bugs
mailing list