[Cod-bugs] COD conversion with HighScore

Saulius Gražulis grazulis at ibt.lt
Sat Jan 4 10:53:59 EET 2020


Dear Thomas,

On 2020-01-03 22:00, Thomas Degen wrote:
> We are indeed processing "_atom_site_type_symbol",

This is good news! I hope with this in mind we can then adapt a workable
policy for the COD curation. Pls. see below.

> but this information is missing for many COD entries.

The _atom_site_type_symbol is indeed missing in most of the CIF supplied
to the COD.

Which is probably even good since this leaves us the unused data name
which we can use for data curation, adding our values without changing
the original data.

> It would be great if the chemical element (type symbol) would be
> unambiguously supplied for each Atom, we would very much appreciate
> this.

At the moment, we follow the IUCr definition of the _atom_site_label and
_atom_site_type_symbol:

https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_type_symbol.html:

"A code to identify the atom species (singular or plural)
occupying this site. /.../ The specification of this code is optional if
component 0 of the _atom_site_label is used for this purpose"

Thus, the _atom_site_type_symbol may be missing, and is indeed missing
in most of the COD entries. In that case, we are supposed to use the
first letters of the _atom_site_label, e.g.:

Fe3+17 is Fe;
C_a_phe_83_a_0 is C (carbon);
O12 is oxygen.

Now, I would be very reluctant to supply the _atom_site_type_symbol
automatically since we can make mistakes; for example HO12 – is it
Holmium Ho or is it hydroxyl OH-? We had case where Ho was incorrectly
inferred instead of hydroxyl, and I suspect we can have Ho species
spelled in all caps as well.

Thus, the addition of _atom_site_type_symbol *requires* manual
inspection, and we physically can not do it for every COD entry (we soon
will have half of the million :). So I suggest adding
_atom_site_type_symbol *only* when the _atom_site_label is ambiguous or
can be interpreted incorrectly, as spotted by processing software (so
you logs are very important for COD data curation!).

If both _atom_site_label and _atom_site_type_symbol are present, then
the _atom_site_type_symbol should be used:

https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_label.html:

"The _atom_site_type_symbol always takes precedence over an
_atom_site_label in the identification of the atom type."

Thus, if we specify a (correct) _atom_site_type_symbol, it will override
the _atom_site_label definition; but if there is no
_atom_site_type_symbol, then _atom_site_label SHOULD be used for atom
type identification.

We would then go through all COD entries that still do not have
_atom_site_type_symbol, and for those where _atom_site_label is
ambiguous or has chance to be incorrect, we add _atom_site_type_label;
the rest we leave untouched (minimal intervention).

> Only in that case we can generate a correct diffraction pattern from
> the atomic coordinates.

Obviously. So, eventually I would suggest the algorithm for determining
the atom type, in pseudocode:

IF _atom_site_type_symbol exists, THEN
   Take the leading *letter* characters of _atom_site_type_symbol
   (e.g.: "O2-"->"O", "Ca2+"->"Ca");
   IF the resulting string matches a known IUPAC atom name, THEN
      Use the resulting string as the atom type name;
   ELSE
      ERROR
   END IF (*inner IF*)
ELSE
   The _atom_site_label MUST exist (else ERROR);
   Take the leading *letter* characters of _atom_site_label
   (e.g.: "O21"->"O", "Ca2+12"->"Ca");
   IF the resulting string matches a known IUPAC atom name, THEN
      Use the resulting string as the atom type name;
   ELSE
      ERROR
   END IF (*inner IF*)
END IF (*outer IF*)

We could also try to correct capitalisation (CA->Ca, ho->Ho) in our
algorithm, but this is probably too risky (again, is "ho" a hydroxyl or
Ho? You never know what people were thinking...).

One note: The IUCr gives examples of _atom_site_type_symbol as
"Fe3+Ni2+"; this implies to me both Fe and Ni on an occupationally
disordered site. The relative occupancies are not explicitly specified
in such case, but the _atom_type_symbol loop MUST contain combined
scattering factors used to refine species on this site. Hopefully we can
handles such cases as well...

It would be then good to check the resulting chemical formula from the
atom coordinate entries with the formula provided by the authors.

We on the COD side would scan the entries that produce error in this
algorithm, merge that list with your logs where the COD entries produced
crazy diffraction patterns (misinterpreting O as Ho should have µ and
F_000 totally off, shouldn't it?), and then add manually the
_atom_site_type_symbol to those entries that can be unambiguously
corrected from the structure source.

Would such policy be OK with you? From our side, it looks doable over
time if the number of entries to be corrected is not dramatically large
(i.e. within limits of thousands of entries).

Regards,
Saulius

PS. I CC to the COD AB list since this concerns policies of the COD
curation.

-- 
Dr. Saulius Gražulis
Vilnius University Institute of Biotechnology, Saulėtekio al. 7
LT-10257 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353
mobile: (+370-684)-49802, (+370-614)-36366

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 213 bytes
Desc: OpenPGP digital signature
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20200104/4e6f0665/attachment-0001.sig>


More information about the Cod-bugs mailing list