[Cod-bugs] COD conversion with HighScore

Saulius Gražulis grazulis at ibt.lt
Mon Jan 6 22:04:55 EET 2020


Dear Thomas & Thomas,

On 2020-01-03 22:00, Thomas Degen wrote:
> We are indeed processing "_atom_site_type_symbol", but this
> information is missing for many COD entries. It would be great if the
> chemical element (type symbol) would be unambiguously supplied for
> each Atom, we would very much appreciate this.

I have made a survey of structures  that have "uncanonical" atom names
(the data list files, *.lst, and logs with Unix commands that were used
to generate them are on my server:
http://saulius-grazulis.lt/~saulius/COD-uncanonical-atoms/).

There are 8360 structures currently in the COD with atom names that do
not match the periodic table (wc -l
COD-structure-with-uncanonical-atoms-counts.lst). This seems doable with
an automatic script.

The most frequent "strange" atom names are:

> saulius at varanas COD-uncanonical-atoms/ $ head -n 21 COD-uncanonical-atom-counts.lst | grep -v '\?'
>    5018 h
>    4653 c
>    4264 Wat
>    2775 MgM
>    2341 FeM
>    2169 SiT
>    2103 AlT
>    1602 AlM
>    1515 n
>    1095 MnM
>     828 TiM
>     825 CaM
>     758 MgT
>     686 CL
>     649 NaA
>     555 o
>     550 FeT
>     539 NaM
>     520 OW
>     455 KA

The Wat is most probably water, and we can fix that. We'll have to guess
that 'h' is hydrogen, 'c' is carbon and 'CL' is chlorine. If we make
such guess, we can check only selected structures to see if this is so.

AlM is probably Aluminium, and other metals are also marked with site M
sign – we can correct this.

More tricky examples are rare: "SIII" might be "Si II" (the second
silicon), or "S III" (the third sulphur). But there are only two of
these in the whole COD, so we can check them by hand.

For all these atoms we would add _atom_site_type_symbol, and, if the
existing _atom_site_type_symbol is converted to different case, we will
fix the _atom_type_symbol accordingly.

The most tricky part is 'NaAm', 'BiMe' or 'CuZn'. Some are single metal
on different sites; we can figure out which are which by comparing
coordinates and occupancies. Bu the CuZn site seems to be mixed Cu and
Zn atom site, and no relative occupancies are given – so we can not
resolve them. We'll have to leave this particular CIF as it is (unless
we can find more data in the paper).

Would such fixes be helpful to you, Thomas? If yes, we think about how
to implement them. The bes would be to get a good student to do it as a
practicum (under the COD team supervision :).

Regards,
Saulius

-- 
Dr. Saulius Gražulis
Vilnius University Institute of Biotechnology, Saulėtekio al. 7
LT-10257 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353
mobile: (+370-684)-49802, (+370-614)-36366

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Cod-bugs mailing list