[Cod-bugs] COD conversion with HighScore

Saulius Gražulis grazulis at ibt.lt
Sat Jan 11 13:19:01 EET 2020


Dear Thomas,

On 2020-01-08 18:22, Thomas Dortmann wrote:
> Thanks a lot for your letters and explanations. I'm sorry that it
> took a while to discuss your input and write this answer.

No problem, your answer is really fast!

> One problem is the CIF syntax, which allows for a lot of freedom (=
> better readability), but must not be unique, when interpreted by
> software. Unfortunately this will not change, and for sure not for
> already existing CIFs.

Indeed, CIF is harder to parse than line-based Unix tool formats. That's
why I would recommend using a full-featured CIF parser. We actually have
a quite decent F/LOSS parser in our cod-tools that works with C/C++,
Perl and Python ;) IMHO, some complexity is unavoidable if we want to
parse unambiguously the data produced by different people at different
times. The format needs to be specified mathematically (CIF is); more
simple formats like CSV actually cause more problems in a longs...

Two notes on atomic coordinates:

In the list you have sent (COD/COD_Conv_Warnings.xlsx), a number of
entries contain warning 'No element found for atom:,"""?"""'

As you probably know, technically, this is not and error in the COD or
in the CIF, since a lone question mark, '?' (without the quotes), is
permissible as a token for any value (including coordinates or Uij!).

COD currently stores a small number of structures for which we do not
have the coordinates; we can not get these coordinates mostly because
structures are behind paywalls. We prefer to keep this records with the
'?' coordinates so that we know these structures were published, so we
can search for them if that opportunity arises. Structures without
coordinates are easy to filter

Another important warning is 'No element found for atom:,"""."""'.

Now, these dots there are for a good reason: most of such entries were
detected to be a fraud at some stage, and were retracted after a careful
investigation! The details are specified in each such CIF file from the
COD. You should filter away these structures and not use them. They are
also marked with '_cod_error_flag retracted' and with the corresponding
status and flag columns in the COD SQL DB.

We can not simply remove these entries from the COD since robots or
people would re-deposit them again, and, even worse, the retraction
notice might go unnoticed. They are also a good reminder of what
happened, and a good indicator of the publication status (if you
download the original CIFs from the publisher's site, they are not
marked in any way; a retraction note is for the paper and is posted as
"correction")

> Converting the CIFs from the COD (to generate peak lists and profiles
> for search-match), we experience the following problems:
> 
> 1. atom_names are not translated correctly into atom_symbols: We
> already correct most from your list of 33900 entries, only the "Wat"
> name was missing and got translated into "At". This is fixed now, but
> we would really prefer to find correct atom_symbols in all CIFs.
> 
> Still there is a number of CIFs where the atom_names are not
> interpreted: 1235 warnings (Btw.: "n" could be either nitrogen or a
> multiplier, we are not sure)

We can offer the following additional data curation mark-up:

- we could convert automatically to a mixed case (like "Ca"), on the COD
side, all upper-case and lower-case atom designators that *look*
strictly like atom names and are unique after the case conversion. We
would add a column _atom_site_type_symbol if it does not exists; if it
exists, we would adjust the _atom_type_symbol accordingly. Thus, "RU"
would be converted to "Ru", "n" and "cl" would become "N" and "Cl"; I am
not sure how to interpret CS (as "Cs" or "C" and "S"; the latter case is
rare). We would inspect some, but most probably not all changed entries
comparing them to original paper.

This would help you to determine atom types in your programs, but *may*
in some rare cases introduce incorrect interpretations (that will
eventually be corrected, of course). If you have your own heuristics for
atom type determination, this *may* break the heuristics in some (rare?)
cases.

Please let me know if this would be beneficial for you, and if the risk
of introducing temporary errors is acceptable. If you say yes, we do the
conversion and commit. If you say no, we do the conversion slowly, in
the background, converting and checking each structure one-by-one.

So, for example, I fixed atom types in the COD 4062572 from the list you
have sent me, by manually editing the supplement file and redepositing
it from the source (along with the other structures from the same paper).

> 2. Anisotropic displacement parameters are way too big; often
> isotropic values don't match anisotropic values in these cases: We
> convert from Us or Betas to Bs; the warning identifies one or more
> Bansio values >= 10. This results in wrong phase quantitations with
> the Rietveld method. (We also recalculate Biso from the Banisos, when
> Biso is missing) This is the biggest group of CIFs generating
> warnings during the conversion: 123.717 warnings

This is an interesting issue, I'll write a separate letter on this.

> 3. invalid space group: 454 CIFs

I think at the moment most space-groups that could be corrected are
corrected. Please not that some CIFs use non-standard settings; the best
think is IMHO to parse symmetry operators when they exist and use them
instead of interpreting the spacegroup symbols.

Some structures from your list are also modulated structures with
super-spacegroups.

We'll see how many entries from the list can be corrected, but probably
not all.

> You find a list of all CIFs throwing warnings during conversion in
> the attached Excel sheet, ordered by the type of warning and the COD
> ID number. I hope this will help to correct errors in the COD and to
> make it better applicable through time. Please let me know when you
> need more information, or want additional checks during our
> conversions to find out more details.

Thanks, I went through the list and will send you some remarks in the
next e-mail; the remarks on '.' and '?' coordinates are above.

> Discussing things with my boss it looks like we are not going  to put
> real effort (like paying a student) into this, because this would
> help the competition as well.

I see your point. The list of issues you have sent us is very useful for
us, and we'll fix the things that we can fix.

Sincerely,
Saulius

-- 
Dr. Saulius Gražulis
Vilnius University Institute of Biotechnology, Saulėtekio al. 7
LT-10257 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353
mobile: (+370-684)-49802, (+370-614)-36366

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 213 bytes
Desc: OpenPGP digital signature
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20200111/2b57a04d/attachment.sig>


More information about the Cod-bugs mailing list