[Cod-bugs] Corrupted files in COD

Wed Feb 1 19:59:12 EET 2023

Dear Steef,

many thanks for your report on the issues with the COD data! Your 
feedback is very valuable for us. I have fixed some of problems (the 
file 7/70/81/7708164.cif should now be OK); regarding others, I give my 
answers below.

On 2023-02-01 00:39, Steef Boerrigter wrote:
> I am currently developing a program in the programming language of D
> to read .cif files and process the contents to calculate various
> things. I am sure I am just one of hundreds to have taken the
> frustrating decision to try and write a comprehensive parser of "STAR"
> formatted files.
As a side note: if writing a CIF parser de-novo feels like frustration, 
you may want to have a look at our CIF parser – maybe it will be easier 
to link it with your program from cod-tools [1,2] than to write a 
completely new one. Although the paper focuses on the Perl 
implementation, there is a core parser ('cifparse') which is in plain C, 
with Perl and Python bindings. It is rather portable – one of my 
students recently linked it with a multi-tasking Ada program :); it 
should not be that difficult to link it with D either. The parser has 
also capability to correct some common mistakes in CIF syntax, such as 
missing closing quotes.
> During testing of my implementation, I came across two files that
> clearly are corrupted. I deleted them on my mirror, re-synced and
> received the exact same corrupted files.
Which protocol did you use for synchronisation? I the latter case, it 
would have probably helped to check our the file from the Subversion 
repository (svn://crystallography.net/cod). Sure enough, SVN is also not 
infallible, but it is a distribution route different from 'rsync' and 
'http(s)', so it may be useful to have such backup. You can also 
'svnsync' the whole repo to have a local read-only copy.
>   So, I am pretty sure the bitrot is on the COD server.
>
> The files are
> 7/70/81/7708164.cif which has zero bytes.

This file was indeed damaged; many thanks for spotting it!

I have restored the file from the repository, and now both 'rsync' and 
'http(s)' protocols should yield correct data. Please have a look. The 
repository seems intact. I'm now comparing checksums for the remaining 
files, to see if there are more corrupt ones on the server. The 'bit 
rot' probably happened when we had HDD failure some time ago.

> 7/05/48/7054812.cif which goes into corruption at line 55186.

This file is a different story. The file itself is in fact syntactically 
correct, served as in the repository, and most of the data are intact. 
However, you are absolutely right, the reflection list from the file is 
garbled at the end of the list. Since the list itself is an a text 
field, a (correct) CIF parser accepts the file. But the reflection list 
can not be used as it is.

The problem comes from the original supplementary data of the article 
[3]; the same corruption is on the line 66863. COD just reproduces this 
situation.

I have written an e-mail to the authors of the original publication. If 
they still have an original file and are ready to share it with us, we 
will update the corresponding COD entry with the correct HKL Fobs list. 
If they do not answer or do not have the file, I think we will probably 
have to curate data by truncating the reflection list at the reflection 
"15  -3   5 -7.40    8.00 166 0.27655 ...", and posting the 
corresponding warning in the CIF. The truncated reflection list, even 
though incomplete, should still be usable (e.g. one can still compute R 
factors, re-refine the structure, etc.)

Please watch the updates (new revisions) of this file.

> During testing, I further came across several hundred files that have rather questionable formatting choices that I would argue are either in violation with the CIF specification
Well, most probably they are not in violation :). We went rather 
carefully through the syntax definitions of CIF and the Tables, and the 
discrepancies were analysed and fixed. The remaining syntax (unless we 
overlooked something very nasty :) ) should satisfy the specification of 
the CIF.
>   or stretch the rules to the extent that it makes it almost impossible for any implementation to interpret the data correctly.

I would say there are a lot of implementations, including our own, that 
parse most of the data correctly, including all symmetry operators (this 
is what we use in our calculations).

> To what extent are the maintainers interested in learning about my findings and potentially amending the entries to fix them?
We are for sure interested to hear you ideas, and will fix things 
wherever possible. We can, however, only take suggestions that have 
absolutely firm mandate in the CIF standard.
> Just to name one example. Apparently the program Maud produces the
> spacegroup operators in the format (see 3/50/01/3500127.cif)
> 1 '-x+0.25, -y+0.25, -z+0.25'
> as opposed to
> 1 '-x+1/4, -y+1/4, -z+1/4'
> To my knowledge, none of the IUCR CIF guidelines, specs, website,
> international tables ever use the decimal format for the translations.

Regarding decimal fractions: I have just additionally looked though my 
copy of the Tables and the CIF dictionaries. That's true, they never use 
decimal points as an example. But I also did not find any place where it 
/forbids/ the use of real numbers in the way Maud does. What is not 
explicitly forbidden is allowed.

The ITC vols. A and B talk about "real numbers" everywhere where 
symmetry operator or matrix notation is involved [4], e.g.:

> The change-of-basis operator V has the general form (v x , v y , v z ).
> The vectors v x , v y and v z are specified by
>
> where $r_{i,j}$ and $t_{i}$ are /fractions/ or /real numbers/ 
> (emphasis mine).

As we see, the numbers are supposed to be /real numbers/, and they are 
explicitly mentioned as distinct from /fractions/. Thus, although all 
examples in the ITC indeed use vulgar fractions for crystallographic 
translations, decimal fractions (a.k. /real numbers/, or approximations 
thereof) seem to be assumed as permissible.

At this point I get impression that neither CIF nor the Tables are 
concerned with standardisation of computer-readable representations; 
they just give mathematical definitions (/real numbers/) and give 
examples of the notions in the text.

Further, the CIF data item definitions say [5]:

>      _item.name                  '_space_group_symop.operation_xyz'
>      # ...
>      _item_examples.detail
>                'x,1/2-y,1/2+z'  'c glide reflection through the plane (x,1/4,z)'
>      _item_description.description
> ;               A*parsable string giving one of the symmetry operations*  of the
>                  space group in algebraic form.

No grammar for '_space_group_symop.operation_xyz' or related fields is 
given.

I interpret these texts in the following way: all unambiguously 
/parsable/ symop descriptions should be accepted, /provided they have 
crystallographic sense./ The interpreter should accept as broad the 
range of syntaxes as possible; of course we should write as narrow range 
as possible, but the latter is valid for one single program and can not 
be applied to the collective database like COD.

The operator '-x+0.25, -y+0.25, -z+0.25' is clearly parsable, clearly 
unambiguous, and clearly crystallographically correct. I therefore see 
no reason (formal or otherwise) to reject it.

Thus, in the COD, we do not convert decimal fractions in the symmetry 
operators ('0.50') to vulgar fractions (1/2) if decimals were present in 
the original file. It is expected that clients can parse both notations 
(we did the conversion for coordinates, though; some people specified 
atom coordinate 'y' as '1/4' or even as 'x' – guess what /that/ means... ;)

My suggestion (and our currently implemented symop parser behaviour) is 
to treat symops in the following way:

1. accept all possible translations notations: 'x+7/6', '1/6+x', 'x-5/7' 
(it is the same as 'x+1/6', and not clear why one should be preferred 
over another!), 'x+0.166667';

2. reconstruct all Seitz matrices from these notations;

3. reduce all translations "modulo 1" (i.e. '7/6' → '1.16667' → '0.16667');

4. snap all crystallographic translations to the nearest 
crystallographic value of your choice (i.e. '0.16667' → 1/6);

5. use rational arithmetic if you platform supports it;

6. Check whether your sympos are crystallographic and whether they form 
a group (all symops that are necessary to reconstruct the unit cell MUST 
be specified, as per CIF dictionaries).

This works, in my hands, for 100% of the COD symops and 99% of the 
symops out there in the wild.

> It is bad enough to have to program an exception to the standard fractional notation, but what happens with the 1/3 translation.
Snap to the nearest crystallographic translation: 0.33333 → Rational (1,3);
> How many decimals should that get in this format.
Standard IEEE 754 single precision float (at least 6 decimal digits) is 
more than enough. In fact, even one digit '0.3' is closer to 1/2 than to 
2/3; so if you "snap to crystallographic values", it should work with 
any precision.
> Even worse is that other entries list the translation as +0.500

Why is this worse than '+0.5'? I would accept general computer language 
floating point number notation here, defined by the extended regexp: 
'[-+]?([0-9]+(\.[0-9]\*)?|\.[0-9]+)([eE][-+]?[0-9]+)?'.

> and I have seen '...z+1' and z+7/6. I mean, why?

In the COD, we tend to leave the symmetry operators as they were encoded 
by the authors of the structure, as long as we can parse them (and we 
can easily parse the above mentioned constructs). The authors might have 
a good reason to include them in such as way; and changing formatting 
just for the sake of changing formatting might introduce extra errors 
and gives us extra work for no real gain.

The 'z+1' is clearly the same as 'z', since all arithmetic is modulo 1. 
Just take the fractional part of all translations you get...

> These are all non-standard translational operators that make it ...

To mark them as "non-standard" we first need to have a standard to check 
against, and to my knowledge there is no explicit standard so far that 
would specify the syntax of symmetry operators.

We are working for such standard for OPTIMADE API [6], but so far it is 
a draft, and will not pertain to CIF, just to the OPTMADE APIs...

> ridiculously difficult to map the operators to a space group.

I would respectfully disagree. Parsing the symmetry operators that are 
currently in use is rather simple; we have implemented it with no real 
difficulty. I agree that it is annoying when we have to guess what other 
people have in mind and would rather have explicit standard, but this is 
the state of the ar so far data exchange, alas...

Also, we can not set any specific format for symops in the COD – not 
only it may introduce more errors, but it will make some people unhappy 
as well, e.g. someone may complain that they need to deal with all these 
vulgar fractions ('1/2') instead of just using standard C/Ada/whatever 
library to parse a "standard" floating point number (e.g. "5.E-01"). The 
wishes will inevitable become contradictory.

Moreover, since, as you have noted, there are programs (Maud) that /do/ 
use decimal floating point translations, your parser will have to deal 
with them anyway, even if we would change the convention in the COD. So 
I see no way around the interpretation of all widespread symop 
encodings, otherwise you will not be able to process some CIFs that are 
in the wild out there (e.g. those from Maud...)

> Allowing all these exceptions make things very challenging. I would argue it improves the quality of the data if these type of things are standardized.
I agree that uniformity is helpful, but in the case of symops they are, 
IMHO, uniform enough to enable automated processing of the whole corpus 
of the COD data, there is no actual problem with it, either when using 
existing libraries or when rolling out your own.
> to what extent would you be willing to receive my findings or what are the possibilities for me to suggest edits?

It is very useful for us to get feedback from you, but we can not act on 
every proposal that we receive. We will fix obvious errors and "data 
rot" ASAP (like 7708164.cif and 7054812.cif); but the things like symop 
encoding we will leave unchanged, since I very strongly insist that 
processing of /all/ symop variants MUST (as in RFC 2119) be implemented 
in every correct CIF library.

Sorry for a long e-mail... hope it will be somewhat helpful.

Sincerely,
Saulius

Refs.:

[1] */Article/(Merkys2016)*Merkys, A.; Vaitkus, A.; Butkus, J.; 
Okulič-Kazarinas, M.; Kairys, V. & Gražulis, S. /COD::CIF::Parser/: an 
error-correcting CIF parser for the Perl language.
/Journal of Applied Crystallography,/2016/, 49/, 292-301, DOI: 
https://doi.org/10.1107/S1600576715022396

[2] Merkys, A. et. al. The 'cod-tools' package. URL: 
https://github.com/cod-developers/cod-tools [accessed 
2023-02-01T15:06+02:00]

[3] Article "Activation of carbon dioxide by new mixed sandwich uranium 
complexes ...", DOI: https://doi.org/10.1039/c5nj00590f – Supplementary 
files, Crystal structure data. URL: 
https://www.rsc.org/suppdata/c5/nj/c5nj00590f/c5nj00590f2.cif [accessed 
2023-02-01T14:40+02:00]

[4] IUCr. /International Tables for Crystallography/ (2006). Vol. B, 
Chapter 1.4, pp. 99–161, "Symmetry in reciprocal space".

[5] IUCr. Symmetry dictionary (symCIF), v1.0.1 (2005). URL: 
https://www.iucr.org/__data/iucr/cif/dictionaries/cif_sym.dic [accessed 
2023-02-01T17:28+02:00]

[6] OPTIMADE issue #416: Insufficient space group descriptions. URL: 
https://github.com/Materials-Consortia/OPTIMADE/issues/416 [accessed 
2023-02-01T19:22+02:00].

-- 
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230201/41637b98/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: a7PSVcWbVgFzm0pa.png
Type: image/png
Size: 4917 bytes
Desc: not available
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230201/41637b98/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20230201/41637b98/attachment-0001.sig>