[Cod-bugs] cif corrections

Tue Nov 22 10:47:14 EET 2022

Dear William,

thank you for the answer!

On 2022-11-19 01:54, William Lenthe wrote:
> Thanks for your detailed response, I had a few logic / parsing errors in my code that I was able to get cleaned up (not ignoring leading whitespace, handling more than 1 loop row per line, and incorrect handling of loops ending in a comment line).
Glad to hear you are working on the development your parser!
> Upon closer inspection my remaining syntax issues are from fields taking the form:
>
> _cif_tag ;value\n
>
> I treat these as the start of a multiline delimited value, e.g. in 7223602:
>
> _computing_structure_solution    ;SHELXS-86'
> _diffrn_ambient_temperature      100(2)
> _diffrn_detector_area_resol_mean 28.5714
> _diffrn_measured_fraction_theta_full 0.982
> _diffrn_measured_fraction_theta_max 0.965
> _diffrn_measurement_device_type
> ;
> Rigaku Kappa 3 circle diffractometer with Saturn 724+ detector.
> ;
> _diffrn_measurement_method       'profile data from \w-scans'
>
> I treat all the_diffrn_[] lines as part of the string starting SHELXs-86'\n and then "Rigaku Kappa..." is seen as an incorrect key since it doesn't start with _.
I see. Well, this behaviour of the parser does not conform to the CIF 
syntax [1]. I would recommend against using it.
> My reading of the cif specification led me to believe that ; are only treated as delimiters if they are the first character of the line,
Indeed, the ';' tokens that delimit multi-line text fields MUST (as in 
RFC 2119) be on the first line. So the specification-compliant 
interpretation of the above fragment would be to treat the ;SHELXS-86' 
token as an unquoted string :/; our COD parser does exactly that, and so 
do all other parsers that I have seen (PyCifRw, vcif, etc.) This would 
result in correct parsing of the 7223602 COD entry.
> ... but when I was strict, I had issues with cifs that contained fields like:
>
> _cif_tag ;
> Multi
> Line
> Field
> ;

This is an erroneous CIF, and a correct CIF parser MUST reject it. The 
first semicolon after the _cif_tag does NOT open a text field, so the 
second semicolon at the beginning of the line remains unpaired. A 
multi-line text field is only started and terminated by a semicolon on 
the very first position of a line [1]. This is what our parser reports:

> saulius at tasmanijos-velnias collection/ $ cat | cifparse
> data_x
> _cif_tag ;
> Multi
> Line
> Field
> ;
> cifparse: -(6) data_x: ERROR, end of file encountered while in text 
> field starting in line 6, possible runaway closing semicolon (';')
> cifparse: -(3,1) data_x: ERROR, incorrect CIF syntax:
>  Multi
>  ^
> cifparse: file '-' FAILED
COD CIFs do not contain such CIFs, all our CIFs pass the syntax checks. 
But in the wild there might be such broken CIFs, even as supplementary 
materials for reputable chemistry papers...

One can apply various "correction heuristics" in such cases; for example 
one could assume that a lone semicolon at the end of the line should be 
actually preceded by a new line. But this is a non-canonical extension 
of the CIF syntax.

I must note that some variant of this mistake /does/ parse correctly:

> data_x
> loop_
> _cif_tag ;
> Multi
> Line
> Field ;
>
Note that in this case /both/ semicolons are not on the first column and 
are interpreted as unquoted strings; and there is a loop_ before the CIF 
tag, therefore all five unquoted strings (;, Multi, Line, Field, ;) end 
up to be values of the '_cif_tag' data item. I see no way of correcting 
this automatically; maybe applying some optional heuristics that lone 
semicolons should be transferred to new lines.

The same situation was detected by your software in the entry 4301644 
and I fixed it manually in the entries 4301644 and 4301643 (both from 
the same paper). The original files were syntactically correct but did 
not convey the intended information.

> So I loosened my parser to allow it.
I would recommend against doing so, because you now reject syntactically 
correct CIFs and risk loosing data. I would only use such interpretation 
if you use a deliberate, optional error correction and recovery (our 
parser corrects some of the common errors from supplementary materials, 
but not this one, unfortunately...).
> I also have seen cifs that use:
>
> _cif_tag ;value that should probably be delimited with quotes;
This is a tag followed by a bunch of unquoted strings; this would be an 
error if it is not in a loop_, valid in the loop_ if the number of data 
values is divisible by the number of data names following the loop_.
> Unfortunately, there isn't an unambiguous way to support all 3 cases. Do you understand any/all of these to be allowable?
IMHO the variants like "_cif_tag ;value that should probably be 
delimited with quotes;" or "_cif_tag ;" are errors and should be 
rejected, or parsed in accordance with the current CIF grammar. It is 
probable that sometimes CIF authors would just guess what the CIF should 
look like without consulting the formal grammar, and come up with texts 
that are not correct (I was guilty of this as well some long time ago 
;). The only way to deal with such CIFs, IMHO, is to find out the 
correct authors' intentions and to fix the file syntax in accordance 
with the grammar, manually or semi-automatically.
>   The following cifs may have some technically correct but unintended values that were generating obtuse errors as a result:
>
> 7223602: _computing_structure_solution    ;SHELXS-86'
Indeed, this is technically correct but with a strange (most probably 
unintended) value of the software name. Can be fixed manually.
> 7228312: _diffrn_measurement_device_type ;Nonius
Again this is correct but probably unintended. Can be fixed manually.
> 7238658: _exptl_absorpt_correction_type   ;multi-scan'

This is syntactically correct but fails validation against the IUCr 
dictionaries:

> /usr/bin/cif_validate: 
> /home/saulius/struct/cod/cif/7/23/86/7238658.cif data_7238658: NOTE, 
> data item '_diffrn_detector_area_resol_mean' value '0.15 mm' violates 
> type constraints -- the value should be a numerically interpretable 
> string, e.g. '42', '42.00', '4200E-2'.
> /usr/bin/cif_validate: 
> /home/saulius/struct/cod/cif/7/23/86/7238658.cif data_7238658: NOTE, 
> data item '_exptl_absorpt_correction_type' value '*;multi-scan'*' must 
> be one of the enumeration values [analytical, cylinder, empirical, 
> gaussian, integration, multi-scan, none, numerical, psi-scan, refdelf, 
> sphere].

Can be fixed manually or semi-automatically (we can add a regexp to our 
data checker if this bug is encountered often enough; but it is probably 
one of a kind error...).

Regards,
Saulius

Refs.:

[1] IUCr. CIF v1.1 File Syntax. URL: 
https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#gram 
[accessed 2022-11-18T18:23+02:00].

-- 
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20221122/6d873e9f/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20221122/6d873e9f/attachment.sig>