[Cod-bugs] {Spam?} RE: COD conversion with HighScore

Thomas Degen thomas.degen at panalytical.com
Mon Jan 6 22:37:44 EET 2020


Dear Saulius,

Sorry for not replying yet on your last email but I was swamped by other work today.
Wow, that information is already very helpful, we can use this to improve our conversion by looking for some of these cases and apply appropriate fixes
Let me answer later in more detail I need to talk to colleagues first.

Thanks and best regards,

Thomas

-----Original Message-----
From: Saulius Gražulis <grazulis at ibt.lt>
Sent: Monday, January 6, 2020 9:05 PM
To: Thomas Degen <thomas.degen at panalytical.com>; Thomas Dortmann <thomas.dortmann at panalytical.com>; cod-bugs at ibt.lt
Cc: Thomas Dortmann <thomas at tdsonline.nl>
Subject: Re: [Cod-bugs] COD conversion with HighScore

Dear Thomas & Thomas,

On 2020-01-03 22:00, Thomas Degen wrote:
> We are indeed processing "_atom_site_type_symbol", but this
> information is missing for many COD entries. It would be great if the
> chemical element (type symbol) would be unambiguously supplied for
> each Atom, we would very much appreciate this.

I have made a survey of structures  that have "uncanonical" atom names (the data list files, *.lst, and logs with Unix commands that were used to generate them are on my server:
https://eur01.safelinks.protection.outlook.com/?url=http:%2F%2Fsaulius-grazulis.lt%2F~saulius%2FCOD-uncanonical-atoms%2F&data=01%7C01%7Cthomas.degen%40panalytical.com%7Cd6cdd3ad3d3648b9ba2608d792e3b65a%7C071061f3d56946889edeb63a6a7f1ecc%7C0&sdata=3HVhmHnoMQwUwNS9gcLAujSb60pU%2BxzlrAxXfUlxCY8%3D&reserved=0).

There are 8360 structures currently in the COD with atom names that do not match the periodic table (wc -l COD-structure-with-uncanonical-atoms-counts.lst). This seems doable with an automatic script.

The most frequent "strange" atom names are:

> saulius at varanas COD-uncanonical-atoms/ $ head -n 21 COD-uncanonical-atom-counts.lst | grep -v '\?'
>    5018 h
>    4653 c
>    4264 Wat
>    2775 MgM
>    2341 FeM
>    2169 SiT
>    2103 AlT
>    1602 AlM
>    1515 n
>    1095 MnM
>     828 TiM
>     825 CaM
>     758 MgT
>     686 CL
>     649 NaA
>     555 o
>     550 FeT
>     539 NaM
>     520 OW
>     455 KA

The Wat is most probably water, and we can fix that. We'll have to guess that 'h' is hydrogen, 'c' is carbon and 'CL' is chlorine. If we make such guess, we can check only selected structures to see if this is so.

AlM is probably Aluminium, and other metals are also marked with site M sign – we can correct this.

More tricky examples are rare: "SIII" might be "Si II" (the second silicon), or "S III" (the third sulphur). But there are only two of these in the whole COD, so we can check them by hand.

For all these atoms we would add _atom_site_type_symbol, and, if the existing _atom_site_type_symbol is converted to different case, we will fix the _atom_type_symbol accordingly.

The most tricky part is 'NaAm', 'BiMe' or 'CuZn'. Some are single metal on different sites; we can figure out which are which by comparing coordinates and occupancies. Bu the CuZn site seems to be mixed Cu and Zn atom site, and no relative occupancies are given – so we can not resolve them. We'll have to leave this particular CIF as it is (unless we can find more data in the paper).

Would such fixes be helpful to you, Thomas? If yes, we think about how to implement them. The bes would be to get a good student to do it as a practicum (under the COD team supervision :).

Regards,
Saulius

--
Dr. Saulius Gražulis
Vilnius University Institute of Biotechnology, SaulÄ—tekio al. 7
LT-10257 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353
mobile: (+370-684)-49802, (+370-614)-36366

--
This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.

This email and any files transmitted with it are confidential and maybe legally privileged. Such message is intended solely for the use of the individual or entity to whom they are addressed. Please notify the originator of the message if you are not the intended recipient and destroy all copies of the message. Please note that any use, dissemination, or reproduction is strictly prohibited and may be unlawful.

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.



More information about the Cod-bugs mailing list