<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Dear William,</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">thank you for the answer!<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 2022-11-19 01:54, William Lenthe
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:SN1PR07MB3902F891C4103CD0C038422888099@SN1PR07MB3902.namprd07.prod.outlook.com">
<pre class="moz-quote-pre" wrap="">Thanks for your detailed response, I had a few logic / parsing errors in my code that I was able to get cleaned up (not ignoring leading whitespace, handling more than 1 loop row per line, and incorrect handling of loops ending in a comment line). </pre>
</blockquote>
Glad to hear you are working on the development your parser!<br>
<blockquote type="cite"
cite="mid:SN1PR07MB3902F891C4103CD0C038422888099@SN1PR07MB3902.namprd07.prod.outlook.com">
<pre class="moz-quote-pre" wrap="">Upon closer inspection my remaining syntax issues are from fields taking the form:
_cif_tag ;value\n
I treat these as the start of a multiline delimited value, e.g. in 7223602:
_computing_structure_solution ;SHELXS-86'
_diffrn_ambient_temperature 100(2)
_diffrn_detector_area_resol_mean 28.5714
_diffrn_measured_fraction_theta_full 0.982
_diffrn_measured_fraction_theta_max 0.965
_diffrn_measurement_device_type
;
Rigaku Kappa 3 circle diffractometer with Saturn 724+ detector.
;
_diffrn_measurement_method 'profile data from \w-scans'
I treat all the <span class="moz-txt-underscore"><span class="moz-txt-tag">_</span>diffrn<span class="moz-txt-tag">_</span></span>[] lines as part of the string starting SHELXs-86'\n and then "Rigaku Kappa..." is seen as an incorrect key since it doesn't start with _. </pre>
</blockquote>
I see. Well, this behaviour of the parser does not conform to the
CIF syntax [1]. I would recommend against using it.<br>
<blockquote type="cite"
cite="mid:SN1PR07MB3902F891C4103CD0C038422888099@SN1PR07MB3902.namprd07.prod.outlook.com">
<pre class="moz-quote-pre" wrap="">My reading of the cif specification led me to believe that ; are only treated as delimiters if they are the first character of the line, </pre>
</blockquote>
Indeed, the ';' tokens that delimit multi-line text fields MUST (as
in RFC 2119) be on the first line. So the specification-compliant
interpretation of the above fragment would be to treat the <font
face="monospace">;SHELXS-86'</font> token as an unquoted string
:/; our COD parser does exactly that, and so do all other parsers
that I have seen (PyCifRw, vcif, etc.) This would result in correct
parsing of the 7223602 COD entry.<br>
<blockquote type="cite"
cite="mid:SN1PR07MB3902F891C4103CD0C038422888099@SN1PR07MB3902.namprd07.prod.outlook.com">
<pre class="moz-quote-pre" wrap="">... but when I was strict, I had issues with cifs that contained fields like:
_cif_tag ;
Multi
Line
Field
;</pre>
</blockquote>
<p>This is an erroneous CIF, and a correct CIF parser MUST reject
it. The first semicolon after the _cif_tag does NOT open a text
field, so the second semicolon at the beginning of the line
remains unpaired. A multi-line text field is only started and
terminated by a semicolon on the very first position of a line
[1]. This is what our parser reports:</p>
<p>
<blockquote type="cite"><font face="monospace">saulius@tasmanijos-velnias
collection/ $ cat | cifparse <br>
data_x<br>
_cif_tag ;<br>
Multi<br>
Line<br>
Field<br>
;<br>
cifparse: -(6) data_x: ERROR, end of file encountered while in
text field starting in line 6, possible runaway closing
semicolon (';')<br>
cifparse: -(3,1) data_x: ERROR, incorrect CIF syntax:<br>
Multi<br>
^<br>
cifparse: file '-' FAILED</font><br>
</blockquote>
COD CIFs do not contain such CIFs, all our CIFs pass the syntax
checks. But in the wild there might be such broken CIFs, even as
supplementary materials for reputable chemistry papers...<br>
</p>
<p>One can apply various "correction heuristics" in such cases; for
example one could assume that a lone semicolon at the end of the
line should be actually preceded by a new line. But this is a
non-canonical extension of the CIF syntax.</p>
<p>I must note that some variant of this mistake <i>does</i> parse
correctly:</p>
<blockquote type="cite">
<p><font face="monospace">data_x<br>
loop_<br>
_cif_tag ;<br>
Multi<br>
Line<br>
Field ;</font></p>
</blockquote>
<p>Note that in this case <i>both</i> semicolons are not on the
first column and are interpreted as unquoted strings; and there is
a loop_ before the CIF tag, therefore all five unquoted strings
(;, Multi, Line, Field, ;) end up to be values of the '_cif_tag'
data item. I see no way of correcting this automatically; maybe
applying some optional heuristics that lone semicolons should be
transferred to new lines.</p>
<p>The same situation was detected by your software in the entry
4301644 and I fixed it manually in the entries 4301644 and 4301643
(both from the same paper). The original files were syntactically
correct but did not convey the intended information.
</p>
<blockquote type="cite"
cite="mid:SN1PR07MB3902F891C4103CD0C038422888099@SN1PR07MB3902.namprd07.prod.outlook.com">
<pre class="moz-quote-pre" wrap="">So I loosened my parser to allow it. </pre>
</blockquote>
I would recommend against doing so, because you now reject
syntactically correct CIFs and risk loosing data. I would only use
such interpretation if you use a deliberate, optional error
correction and recovery (our parser corrects some of the common
errors from supplementary materials, but not this one,
unfortunately...).<br>
<blockquote type="cite"
cite="mid:SN1PR07MB3902F891C4103CD0C038422888099@SN1PR07MB3902.namprd07.prod.outlook.com">
<pre class="moz-quote-pre" wrap="">I also have seen cifs that use:
_cif_tag ;value that should probably be delimited with quotes;</pre>
</blockquote>
This is a tag followed by a bunch of unquoted strings; this would be
an error if it is not in a loop_, valid in the loop_ if the number
of data values is divisible by the number of data names following
the loop_.
<blockquote type="cite"
cite="mid:SN1PR07MB3902F891C4103CD0C038422888099@SN1PR07MB3902.namprd07.prod.outlook.com">
<pre class="moz-quote-pre" wrap="">Unfortunately, there isn't an unambiguous way to support all 3 cases. Do you understand any/all of these to be allowable?</pre>
</blockquote>
IMHO the variants like "<font face="monospace">_cif_tag ;value that
should probably be delimited with quotes;</font>" or "<font
face="monospace">_cif_tag ;</font>" are errors and should be
rejected, or parsed in accordance with the current CIF grammar. It
is probable that sometimes CIF authors would just guess what the CIF
should look like without consulting the formal grammar, and come up
with texts that are not correct (I was guilty of this as well some
long time ago ;). The only way to deal with such CIFs, IMHO, is to
find out the correct authors' intentions and to fix the file syntax
in accordance with the grammar, manually or semi-automatically.<br>
<blockquote type="cite"
cite="mid:SN1PR07MB3902F891C4103CD0C038422888099@SN1PR07MB3902.namprd07.prod.outlook.com">
<pre class="moz-quote-pre" wrap=""> The following cifs may have some technically correct but unintended values that were generating obtuse errors as a result:
7223602: _computing_structure_solution ;SHELXS-86'</pre>
</blockquote>
Indeed, this is technically correct but with a strange (most
probably unintended) value of the software name. Can be fixed
manually.
<blockquote type="cite"
cite="mid:SN1PR07MB3902F891C4103CD0C038422888099@SN1PR07MB3902.namprd07.prod.outlook.com">
<pre class="moz-quote-pre" wrap="">7228312: _diffrn_measurement_device_type ;Nonius</pre>
</blockquote>
Again this is correct but probably unintended. Can be fixed
manually.
<blockquote type="cite"
cite="mid:SN1PR07MB3902F891C4103CD0C038422888099@SN1PR07MB3902.namprd07.prod.outlook.com">
<pre class="moz-quote-pre" wrap="">7238658: _exptl_absorpt_correction_type ;multi-scan'</pre>
</blockquote>
<p>This is syntactically correct but fails validation against the
IUCr dictionaries:</p>
<p>
<blockquote type="cite"><font face="monospace">/usr/bin/cif_validate:
/home/saulius/struct/cod/cif/7/23/86/7238658.cif data_7238658:
NOTE, data item '_diffrn_detector_area_resol_mean' value '0.15
mm' violates type constraints -- the value should be a
numerically interpretable string, e.g. '42', '42.00',
'4200E-2'.<br>
</font></blockquote>
<blockquote type="cite"><font face="monospace">/usr/bin/cif_validate:
/home/saulius/struct/cod/cif/7/23/86/7238658.cif data_7238658:
NOTE, data item '_exptl_absorpt_correction_type' value '<font
color="#ef2929"><b>;multi-scan'</b></font>' must be one of
the enumeration values [analytical, cylinder, empirical,
gaussian, integration, multi-scan, none, numerical, psi-scan,
refdelf, sphere].</font><br>
</blockquote>
</p>
<p>Can be fixed manually or semi-automatically (we can add a regexp
to our data checker if this bug is encountered often enough; but
it is probably one of a kind error...).</p>
<p>Regards,<br>
Saulius<br>
</p>
<p>Refs.:</p>
<p>[1] IUCr. CIF v1.1 File Syntax. URL: <a
class="moz-txt-link-freetext"
href="https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#gram">https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#gram</a>
[accessed 2022-11-18T18:23+02:00].</p>
<pre class="moz-signature" cols="72">--
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366
</pre>
</body>
</html>