[Cod-bugs] COD conversion with HighScore
Thomas Dortmann
thomas.dortmann at panalytical.com
Wed Jan 8 18:22:49 EET 2020
Hi Saulius!
Thanks a lot for your letters and explanations. I'm sorry that it took a while to discuss your input and write this answer.
One problem is the CIF syntax, which allows for a lot of freedom (= better readability), but must not be unique, when interpreted by software. Unfortunately this will not change, and for sure not for already existing CIFs.
Converting the CIFs from the COD (to generate peak lists and profiles for search-match), we experience the following problems:
1. atom_names are not translated correctly into atom_symbols:
We already correct most from your list of 33900 entries, only the "Wat" name was missing and got translated into "At". This is fixed now, but we would really prefer to find correct atom_symbols in all CIFs.
Still there is a number of CIFs where the atom_names are not interpreted: 1235 warnings
(Btw.: "n" could be either nitrogen or a multiplier, we are not sure)
2. Anisotropic displacement parameters are way too big; often isotropic values don't match anisotropic values in these cases:
We convert from Us or Betas to Bs; the warning identifies one or more Bansio values >= 10. This results in wrong phase quantitations with the Rietveld method. (We also recalculate Biso from the Banisos, when Biso is missing)
This is the biggest group of CIFs generating warnings during the conversion: 123.717 warnings
3. invalid space group: 454 CIFs
You find a list of all CIFs throwing warnings during conversion in the attached Excel sheet, ordered by the type of warning and the COD ID number. I hope this will help to correct errors in the COD and to make it better applicable through time. Please let me know when you need more information, or want additional checks during our conversions to find out more details.
Discussing things with my boss it looks like we are not going to put real effort (like paying a student) into this, because this would help the competition as well.
best regards,
Thomas Dortmann
-----Original Message-----
From: Saulius Gražulis <grazulis at ibt.lt>
Sent: 06 January 2020 21:05
To: Thomas Degen <thomas.degen at panalytical.com>; Thomas Dortmann <thomas.dortmann at panalytical.com>; cod-bugs at ibt.lt
Cc: Thomas Dortmann <thomas at tdsonline.nl>
Subject: Re: [Cod-bugs] COD conversion with HighScore
Dear Thomas & Thomas,
On 2020-01-03 22:00, Thomas Degen wrote:
> We are indeed processing "_atom_site_type_symbol", but this
> information is missing for many COD entries. It would be great if the
> chemical element (type symbol) would be unambiguously supplied for
> each Atom, we would very much appreciate this.
I have made a survey of structures that have "uncanonical" atom names (the data list files, *.lst, and logs with Unix commands that were used to generate them are on my server:
https://eur01.safelinks.protection.outlook.com/?url=http:%2F%2Fsaulius-grazulis.lt%2F~saulius%2FCOD-uncanonical-atoms%2F&data=01%7C01%7Cthomas.dortmann%40panalytical.com%7Cd6cdd3ad3d3648b9ba2608d792e3b65a%7C071061f3d56946889edeb63a6a7f1ecc%7C0&sdata=QQ95rzT24YIl3o0PI15XYegtPFQvUUGmOC9fwve%2B3Mw%3D&reserved=0).
There are 8360 structures currently in the COD with atom names that do not match the periodic table (wc -l COD-structure-with-uncanonical-atoms-counts.lst). This seems doable with an automatic script.
The most frequent "strange" atom names are:
> saulius at varanas COD-uncanonical-atoms/ $ head -n 21 COD-uncanonical-atom-counts.lst | grep -v '\?'
> 5018 h
> 4653 c
> 4264 Wat
> 2775 MgM
> 2341 FeM
> 2169 SiT
> 2103 AlT
> 1602 AlM
> 1515 n
> 1095 MnM
> 828 TiM
> 825 CaM
> 758 MgT
> 686 CL
> 649 NaA
> 555 o
> 550 FeT
> 539 NaM
> 520 OW
> 455 KA
The Wat is most probably water, and we can fix that. We'll have to guess that 'h' is hydrogen, 'c' is carbon and 'CL' is chlorine. If we make such guess, we can check only selected structures to see if this is so.
AlM is probably Aluminium, and other metals are also marked with site M sign – we can correct this.
More tricky examples are rare: "SIII" might be "Si II" (the second silicon), or "S III" (the third sulphur). But there are only two of these in the whole COD, so we can check them by hand.
For all these atoms we would add _atom_site_type_symbol, and, if the existing _atom_site_type_symbol is converted to different case, we will fix the _atom_type_symbol accordingly.
The most tricky part is 'NaAm', 'BiMe' or 'CuZn'. Some are single metal on different sites; we can figure out which are which by comparing coordinates and occupancies. Bu the CuZn site seems to be mixed Cu and Zn atom site, and no relative occupancies are given – so we can not resolve them. We'll have to leave this particular CIF as it is (unless we can find more data in the paper).
Would such fixes be helpful to you, Thomas? If yes, we think about how to implement them. The bes would be to get a good student to do it as a practicum (under the COD team supervision :).
Regards,
Saulius
--
Dr. Saulius Gražulis
Vilnius University Institute of Biotechnology, Saulėtekio al. 7
LT-10257 Vilnius, Lietuva (Lithuania)
fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353
mobile: (+370-684)-49802, (+370-614)-36366
--
This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
This email and any files transmitted with it are confidential and maybe legally privileged. Such message is intended solely for the use of the individual or entity to whom they are addressed. Please notify the originator of the message if you are not the intended recipient and destroy all copies of the message. Please note that any use, dissemination, or reproduction is strictly prohibited and may be unlawful.
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: COD_Conv_Warnings.xlsx
Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Size: 3940157 bytes
Desc: COD_Conv_Warnings.xlsx
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20200108/0f817e26/attachment-0001.xlsx>
More information about the Cod-bugs
mailing list