[Cod-bugs] cif corrections

Saulius Gražulis grazulis at ibt.lt
Fri Nov 18 19:23:42 EET 2022


Dear William,

thank you very much for the answer and for the COD processing logs! They 
are indeed very useful for us.

Below, I give some comments on the kinds of issues your software has 
found. Some are hard to fix and reflect the real situation in the 
crystal, or the choice and opinion of the structure author. Some, 
however, are serious problems which we missed and which we will have to 
investigate.

Details below:

On 2022-11-17 18:13, William Lenthe wrote:
> Please find a list of the errors generated by my parsing of the database. I've manually removed a few types of errors that are due to limitations in my code. For example I don't handle modulated structures (I count 285 in the database)
I found 287 in the recent revision of the COD, which is probably the 
same as yours with just some new structures. Confirmed.
> or support cifs without a space group (15), lattice parameters (10), and at least 1 atom (there are so many of these I didn't count). I've also removed 286 instances of an error message like:
>
> 1501515.cif: _atom_type_symbol 'Ti4+' not found in _atom_site_label loop

I get similar counts; confirmed. The _atom_type_symbol needs to be fixed 
systematically, somehow, but there are 115772 such messages, so hard to 
go through manually.

The '_atom_site_type_symbol' value Ti5+ seems to be a mismatch with the 
'_atom_type_symbol' value 'Ti4+'; this is probably a mistake but we need 
to check the original paper before correcting.

> Since it looks like the database you send already catches that type. I left in the case sensitive versions since they should be easy fixes. Finally also have a list of ~35k warnings that are almost all one of these types:
>
> 1006173.cif: clamped site O2 occupancy from 1.005(5) to 1.
Here, the occupancies seem to be refined. The occupancies are both 
within error margins of 1.0. The occupancies of oxygens O1 and O2 add up 
to 2.0, so clamping one but not adjusting the other will give slightly 
incorrect total oxygen count in the structure. Whether this is an issue 
will depend on your application, of course... We'll leave these values 
as they are in the COD since this is how authors reported the structure; 
of they refined the occupancies we need to leave indication that this 
was done so.
> 1000495.cif: merging equivalent positions to 0.365000 0.365000 0.375000 with total occupancy 0.25 from sites labeled {Cs1, Cs2}.
Specifying multiple atoms at exactly the same site seems to be an 
accepted way to represent occupational disorder. I would say this is a 
feature, not a bug ;)
> 1008070.cif: space group is triclinic but lattice constants are orthorhombic.

Monoclinic space groups can gave (nearly) any angles, including 90.0 
degree angles. It would be of course strange to see monoclinic cell with 
all right angles just by accident, but in this case the abstract of the 
paper [2] says:

"For /x/ > 0.9, these compounds have an orthorhombic symmetry (/O/) if 
the cations are disordered, while the symmetry lowers to monoclinic 
(/Mβ/) if the cations are ordered"

Thus, there is either a re-intepretation of orthorombic data as 
monoclinic, or a transition between ordered and disordered phase here, 
which do not change the cell angles. Thus I would say the angles are legit.

> 1100066.cif: corrected trigonal/hexagonal unequal a/b from 9.048(1)/9.047673 to 9.047836.
The cell dimensions are within error margin of each other, so we 
probably leave them as the authors have reported them. Of course your 
software is absolutely correct to merge the values if that is needed for 
your application.
> 1503454.cif: corrected monoclinic b alpha from 89.990(6) to 90. corrected monoclinic b gamma from 89.995(6) to 90.
Again, angles are within the specified error margins, and probably were 
refined (not fixed), so we leave them as they are in the COD.
>
> The file is a ~11mb so I'll send it using a file service (hightail) instead of as an attachment

Thanks, I have stored it in our private repository, and we will consult 
the file when we have auotmatic procedure to fix some of the issues, or 
when we look for the issues that can be fixed manually...

This is the catalogue of messages that I have extracted:

> saulius at tasmanijos-velnias 2022-11-17/ $ awk -F: '{print 
> substr($2,2,12)}' cod_wrn.txt | sort | uniq -c | sort -nr -k1,1 | cat -n
>      1      33414 merging equi
>      2        967 clamped site
>      3        277 _atom_site_a
>      4        224 space group
>      5        219 no space gro
>      6         86 corrected tr
>      7         84 corrected mo
>      8         74 corrected te
>      9         51 corrected or
>     10         22 corrected he
>     11         21 corrected cu
>     12          1 corrected rh
Below, I analyse all unique messages from 'cod_err_flt.txt':

> saulius at tasmanijos-velnias 2022-11-17/ $ awk -F: '{print 
> substr($2,2,20)}' cod_err_flt.txt | sort | uniq -c | sort -nr -k1,1
>     124 _atom_site_aniso_lab
>      51 failed to unambiguou
>      30 cif block has confli
>      17 failed to parse spac
>       5 cif block contains l
>       3 cif block tag '_chem
>       1 monoclinic groups mu
>       1 line 86 isn't commen
>       1 line 55212 isn't com
>       1 line 490 isn't comme
>       1 line 132 isn't comme
>       1 line 128 isn't comme
>       1 cif block loop has m
>       1 cif block has loop r

The '_atom_site_aniso_lab' message is genuine validation warning; we'll 
triage and probably fix in due time, if it is possible to fix at all.

The 'failed to unambiguously determine space group setting' message is 
correct, there seem to be no symmetry operators nor Hall symbol in these 
files. International Tables would imply a default setting, but this 
might be dangerous to assume. Probably we can not fix this unless 
authors confirm the setting hey used, or the setting is recorded in the 
paper.

The 'failed to parse space group from string' message is correct; these 
are either incorrectly recorded space groups from the original CIFs 
(which we can not fix, probably), or modulated structures which have 
superspacegoup name instead of the space group name; these we can 
probably fix some day when we have superspacegroup reduction to 
spacegroup code integrated into the COD pipeline.

The following finding is for me quite worrying:

> saulius at tasmanijos-velnias 2022-11-17/ $ grep 'cif block has 
> conflicted hall symbol' cod_err_flt.txt  | head -5
> 1010928.cif: cif block has conflicted hall symbol (-P 3* 2n) and space 
> group operators (recovered p_3*_2_-1n)
> 1010956.cif: cif block has conflicted hall symbol (-P 2n 2a) and space 
> group operators (recovered p_2bc_2ac_-1ac)
> 1010962.cif: cif block has conflicted hall symbol (-P 3* 2n) and space 
> group operators (recovered p_3*_2_-1n)
> 1011149.cif: cif block has conflicted hall symbol (-P 2n 2a) and space 
> group operators (recovered p_2bc_2a_-1a)
> 2002944.cif: cif block has conflicted hall symbol (-P 4 2ab) and space 
> group operators (recovered p_4_2ab)

Indeed, the Hall symbols and the symmetry operators in the structures do 
not match (30 cases). We'll have to look at the original publications to 
find out why this is so. We'll add the code to check symop-Hall symbol 
correspondence to our COD check routines. Thanks for pointing this out!

> saulius at tasmanijos-velnias 2022-11-17/ $ grep ': line' cod_err_flt.txt
> 4029286.cif: line 490 isn't comment or part of loop row of cif but 
> doesn't have _
> 7223602.cif: line 86 isn't comment or part of loop row of cif but 
> doesn't have _
> 7228312.cif: line 128 isn't comment or part of loop row of cif but 
> doesn't have _
> 7238658.cif: line 132 isn't comment or part of loop row of cif but 
> doesn't have _
> 7705257.cif: line 55212 isn't comment or part of loop row of cif but 
> doesn't have _

It is not quite clear what this error message is saying but yes, all 
these cases are not comments and not parts of loops; they are parts of 
multi-line text fields delimited by ';' tokens and can contain arbitrary 
text (well, nearly arbitrary). Could it be that your CIF parser misses 
the beginning of a text field?

Of these files, the 7705257.cif contains a garbled HKL Fobs reflection 
list, the rest seem OK.

> saulius at tasmanijos-velnias 2022-11-17/ $ grep 'cif block contains l' 
> cod_err_flt.txt
> 4002451.cif: cif block contains loop_ not followed by any tags at line 59
> 4130765.cif: cif block contains loop_ not followed by any tags at line 943
> 7035327.cif: cif block contains loop_ not followed by any tags at line 
> 6035
> 7035331.cif: cif block contains loop_ not followed by any tags at line 
> 6694
> 7035332.cif: cif block contains loop_ not followed by any tags at line 
> 6802

These files (as all COD files) are syntactically OK and tags do follow 
the 'loop_' token. Cold it be that you parser fails to discard spaces at 
the beginning of the line before the tags?

> saulius at tasmanijos-velnias 2022-11-17/ $ grep "cif block tag '_chem" 
> cod_err_flt.txt
> 4034776.cif: cif block tag '_chemical_name_systematic' followed by new 
> line and quoted string at 36 but quoted string doesn't close or fill 
> entire line
> 7201872.cif: cif block tag '_chemical_name_common' followed by new 
> line and quoted string at 41 but quoted string doesn't close or fill 
> entire line
> 7233594.cif: cif block tag '_chemical_name_systematic' followed by new 
> line and quoted string at 37 but quoted string doesn't close or fill 
> entire line

The situation is funny with these files. Syntactically, they are correct 
– the CIF syntax [1] permits any <NonBlankChar> as a trailing character 
of a <UnquotedString>, including a single quote ("'")! So your parser 
misleads us: there is no quoted string in these cases, but an /unquoted/ 
string instead that is terminated with a quote (which is a part of the 
value).

However, chemical names in all these files seem to have a superfluous 
trailing quote that should not be included in the name. We should curate 
these entries according to the chemical names given in the respective 
papers. Our CIF checker (cif_cod_check) could probably issue a warning 
when such strange chemical names are encountered...

> saulius at tasmanijos-velnias 2022-11-17/ $ grep 'cif block loop has m' 
> cod_err_flt.txt
> 4301644.cif: cif block loop has multi line delimeter token mid line at 
> line 134

Again, this is OK syntactically but does not represent the intended 
data. CIF syntax is weird... Another check needed for 'cif_cod_check'?

I have fixed the entry in the COD :).

> saulius at tasmanijos-velnias 2022-11-17/ $ grep 'cif block has loop r' 
> cod_err_flt.txt
> 4342694.cif: cif block has loop row with 4 columns at line 153 but 
> loop has 2 column headers

This is a false alarm; the loop is perfectly OK:

> saulius at tasmanijos-velnias 2022-11-17/ $ sed -n '152,153p' 
> $(codid2file 4342694)
> _space_group_symop_operation_xyz
> 1 x,y,z 2 -x,-y,-z

Typing two data packets ('loop rows') in one physical line is perfectly 
OK in CIF.

Regards,
Saulius

Refs.:

[1] IUCr. CIF v1.1 File Syntax. URL: 
https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#gram 
[accessed 2022-11-18T18:23+02:00].

[2] Muller, J.; Joubert, J. C. & Marezio, M.
Etude des phases du système FeVO4–VO2, obtenues par synthèse 
hydrothermale à 70 kbar et 1000textdegreeC
/Journal of Solid State Chemistry,//Elsevier BV,/*1976*/, 18/, 357-362, 
DOI: https://doi.org/10.1016/0022-4596(76)90118-3

Sincerely yours,
Saulius

-- 
Dr. Saulius Gražulis
Vilnius University, Life Science Center, Institute of Biotechnology
Saulėtekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania)
phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20221118/c6f0e112/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 665 bytes
Desc: OpenPGP digital signature
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20221118/c6f0e112/attachment-0001.sig>


More information about the Cod-bugs mailing list