From thomas.dortmann at panalytical.com Fri Jan 3 15:30:08 2020 From: thomas.dortmann at panalytical.com (Thomas Dortmann) Date: Fri, 3 Jan 2020 13:30:08 +0000 Subject: [Cod-bugs] COD conversion with HighScore Message-ID: Hi Saulius, We (once more) converted the COD latest release from October 2019 with our HighScore software. Before I am asking you to prepare a download of the converted database (as in previous years), I have a couple of questions. We enhanced our tests during the database conversion and are now confronted with at least 2 issues during the conversion: 1. Naming of water (oxygen) positions as in for example COD entry 9015086 as Wat1, Wat2 and so on. Questions: Is this a standard way of indicating water positions in the COD? Are there other naming conventions for water positions in the COD? 1. We check the values of Biso and Baniso, and we also convert Baniso-values back to Biso values; in COD entry 9014636 there are very big Uaniso-values (converted into Baniso values > 10), but small Uiso-values? Questions: does the COD apply a sanity check on the supplied B (or U) values, and do you compare anisotropic with isotropic values? We can easily give you a list of all COD entries which have (converted) B-values > 10, if that helps. I am still waiting for the original literature of these two examples to exclude any input errors for the B's. Best regards, and a happy new year 2020 to you! Thomas Dortmann This email and any files transmitted with it are confidential and maybe legally privileged. Such message is intended solely for the use of the individual or entity to whom they are addressed. Please notify the originator of the message if you are not the intended recipient and destroy all copies of the message. Please note that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: From grazulis at ibt.lt Fri Jan 3 21:35:41 2020 From: grazulis at ibt.lt (=?UTF-8?Q?Saulius_Gra=c5=beulis?=) Date: Fri, 3 Jan 2020 21:35:41 +0200 Subject: [Cod-bugs] COD conversion with HighScore In-Reply-To: References: Message-ID: Dear Thomas & Thomas, thank you very much for you e-mail, it is great to hear from you again! On the occasion, my best wishes for the New Year 2020 from the COD team! On 2020-01-03 15:30, Thomas Dortmann wrote: > We (once more) converted the COD latest release from October 2019 with > our HighScore software. Great, we will be happy to host the new file, as agreed before. Answering your questions: > 1. Naming of water (oxygen) positions as in for example COD entry > 9015086 as Wat1, Wat2 and so on. > > Questions: ???????? Is this a standard way of indicating water positions > in the COD? > > Are there other naming conventions for water positions in the COD? There are no conventions for Water atom names in the COD. When depositing files, we leave the original atom names provided by the author; I believe we should not change these names to make data traceable back to the original publications. The 9015086 record comes from the AMCSD, and the WatN is the convention that AMCSD uses; however, it is not widespread outside AMCSD. Thus, other COD entries MAY contain different names for water residue atoms, and you should not rely on atom name Wat to infer whether an atom belongs to a water. The COD approach to indicate water positions is the following: a/ we add _atom_site_type_symbol with the atom chemical name (according to the Mendelejev periodic table), "O", for the WatN atoms, so that we (and software ;) know this is an Oxygen; b/ we add _atom_site_attached_hydrogens with the value "2" for the WatN sites; this would give summary formula H2O indicating water for these sites, and would be a correct way to maintain hydrogen balance without introducing spurious hydrogen sites with unknown coordinates. BTW, the same rule is applied to ammonium ions, sulphurs, carbons at low resolution, etc. ? any atoms that may contain invisible hydrogens attached to them. In this way, the original authors' atoms names and their data are not changed, we just add additional interpretation of the COD files (and we will check if this interpretation is consistent with the original paper). The new table for the entry COD 9015086 would look as follows: > loop_ > _atom_site_type_symbol > _atom_site_attached_hydrogens > _atom_site_label > _atom_site_fract_x > _atom_site_fract_y > _atom_site_fract_z > _atom_site_occupancy > _atom_site_U_iso_or_equiv > V 0 V 0.42846 0.42846 0.08140 0.87000 0.04260 > Al 0 Al 0.42846 0.42846 0.08140 0.13000 0.04260 > P 0 P 0.25000 0.50000 0.00000 1.00000 0.04900 > O 0 O1 0.43570 0.30590 0.05150 1.00000 0.04200 > O 1 O-H2 0.42140 0.42140 0.18660 1.00000 0.05000 > O 1 O-H3 0.55610 -0.55610 0.06640 1.00000 0.04300 > Ca 0 Ca 0.65900 -0.65900 0.16050 0.25000 0.27200 > O 2 Wat1 0.65900 -0.65900 0.16050 0.61000 0.27200 > O 2 Wat2 0.29380 0.29380 0.29380 1.00000 0.19400 > O 2 Wat3 0.33610 0.45200 0.33610 0.56000 0.13000 > O 2 Wat4 0.24510 0.49000 0.24510 1.00000 0.22100 > O 2 Wat5 0.34500 0.54200 -0.54200 0.67000 0.43000 > O 2 Wat6 0.30900 0.69100 -0.69100 0.54000 0.44000 > O 2 Wat7 0.29500 0.59600 -0.59600 0.20000 0.15000 (I assume that O-H2 and O-H3 are hydroxyl ions, thus I indicate they have 1 hydrogen attached to each of them, but we need to check the original paper). Would your software process such markup? I think this is a good, standard, unambiguous way to indicate waters, without messing up with authors' data too much. (both '_atom_site_type_symbol' and '_atom_site_attached_hydrogens' are standard IUCr data names, the COD just adds a convention that _atom_site_type_symbol SHOULD contain the periodic system IUPAC atom name, or "D" for Deuterium; with possibly atom charge attached). Currently, we mark up the structures as we process them for ourselves; if an automated procedure can be devised for spotting all such entries (sure it can be done), we could add such fixes to all COD structures that requires it, if that would be helpful for you. There is already as set of structures (8509 COD entries) marked up in this way, e.g.: https://www.crystallography.net/cod/9004888.cif https://www.crystallography.net/cod/9003573.cif https://www.crystallography.net/cod/9002900.cif https://www.crystallography.net/cod/9000403.cif https://www.crystallography.net/cod/9001176.cif https://www.crystallography.net/cod/9001786.cif https://www.crystallography.net/cod/9001785.cif https://www.crystallography.net/cod/9009869.cif https://www.crystallography.net/cod/9009872.cif https://www.crystallography.net/cod/9009840.cif > 2. We check the values of Biso and Baniso, and we also convert > Baniso-values back to Biso values; > > in COD entry 9014636 there are very big Uaniso-values (converted into > Baniso values > 10), but small Uiso-values? > > Questions: does the COD apply a sanity check on the supplied B (or U) > values, and do you compare anisotropic with isotropic values? No, we do not check the Uij iso/aniso consistency so far... Thank you for the error report; such inconsistencies are for sure errors and need to be checked. I think we can relatively easy add this extra check into our pipeline. > We can easily give you a list of all COD entries which have (converted) > B-values > 10, if that helps. That would be very helpful. I can not promise that we fix them soon if there is a substantial manual work involved, but we will note the list in our COD bug list and try to deal with it ASAP. > I am still waiting for the original literature of these two examples to > exclude any input errors for the B?s. Yes, we should double check against the originals, this is very wise. I suspect some entries may contain scaling errors (B instead of U, or x10^3 vs x10^4 scale in tables), but we definitely need to check... > Best regards, and a happy new year 2020 to you! Many great thanks! Best, Saulius -- Dr. Saulius Gra?ulis Vilnius University Institute of Biotechnology, Saul?tekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: From thomas.degen at panalytical.com Fri Jan 3 22:00:44 2020 From: thomas.degen at panalytical.com (Thomas Degen) Date: Fri, 3 Jan 2020 20:00:44 +0000 Subject: [Cod-bugs] COD conversion with HighScore In-Reply-To: References: Message-ID: Dear Saulius, Thank you for the swift answer and a Happy New Year 2020 for you too ! We are indeed processing "_atom_site_type_symbol", but this information is missing for many COD entries. It would be great if the chemical element (type symbol) would be unambiguously supplied for each Atom, we would very much appreciate this. Only in that case we can generate a correct diffraction pattern from the atomic coordinates. Concerning the list of patterns having unusually large isotropic/anisotropic displacement parameters. We will have a look at some in more detail and generate a list of all entries that are suspicious (having B's > 10). Note, that structures with wrong/too big displacement parameters produce very wrong quantitative results When used in a Rietveld type (QPA) refinement. Best regards, Thomas -----Original Message----- From: Saulius Gra?ulis Sent: Friday, January 3, 2020 8:36 PM To: Thomas Dortmann ; cod-bugs at ibt.lt Cc: Thomas Degen Subject: Re: [Cod-bugs] COD conversion with HighScore Dear Thomas & Thomas, thank you very much for you e-mail, it is great to hear from you again! On the occasion, my best wishes for the New Year 2020 from the COD team! On 2020-01-03 15:30, Thomas Dortmann wrote: > We (once more) converted the COD latest release from October 2019 with > our HighScore software. Great, we will be happy to host the new file, as agreed before. Answering your questions: > 1. Naming of water (oxygen) positions as in for example COD entry > 9015086 as Wat1, Wat2 and so on. > > Questions: Is this a standard way of indicating water > positions in the COD? > > Are there other naming conventions for water positions in the COD? There are no conventions for Water atom names in the COD. When depositing files, we leave the original atom names provided by the author; I believe we should not change these names to make data traceable back to the original publications. The 9015086 record comes from the AMCSD, and the WatN is the convention that AMCSD uses; however, it is not widespread outside AMCSD. Thus, other COD entries MAY contain different names for water residue atoms, and you should not rely on atom name Wat to infer whether an atom belongs to a water. The COD approach to indicate water positions is the following: a/ we add _atom_site_type_symbol with the atom chemical name (according to the Mendelejev periodic table), "O", for the WatN atoms, so that we (and software ;) know this is an Oxygen; b/ we add _atom_site_attached_hydrogens with the value "2" for the WatN sites; this would give summary formula H2O indicating water for these sites, and would be a correct way to maintain hydrogen balance without introducing spurious hydrogen sites with unknown coordinates. BTW, the same rule is applied to ammonium ions, sulphurs, carbons at low resolution, etc. ? any atoms that may contain invisible hydrogens attached to them. In this way, the original authors' atoms names and their data are not changed, we just add additional interpretation of the COD files (and we will check if this interpretation is consistent with the original paper). The new table for the entry COD 9015086 would look as follows: > loop_ > _atom_site_type_symbol > _atom_site_attached_hydrogens > _atom_site_label > _atom_site_fract_x > _atom_site_fract_y > _atom_site_fract_z > _atom_site_occupancy > _atom_site_U_iso_or_equiv > V 0 V 0.42846 0.42846 0.08140 0.87000 0.04260 Al 0 Al 0.42846 0.42846 > 0.08140 0.13000 0.04260 P 0 P 0.25000 0.50000 0.00000 1.00000 0.04900 > O 0 O1 0.43570 0.30590 0.05150 1.00000 0.04200 O 1 O-H2 0.42140 > 0.42140 0.18660 1.00000 0.05000 O 1 O-H3 0.55610 -0.55610 0.06640 > 1.00000 0.04300 Ca 0 Ca 0.65900 -0.65900 0.16050 0.25000 0.27200 O 2 > Wat1 0.65900 -0.65900 0.16050 0.61000 0.27200 O 2 Wat2 0.29380 > 0.29380 0.29380 1.00000 0.19400 O 2 Wat3 0.33610 0.45200 0.33610 > 0.56000 0.13000 O 2 Wat4 0.24510 0.49000 0.24510 1.00000 0.22100 O 2 > Wat5 0.34500 0.54200 -0.54200 0.67000 0.43000 O 2 Wat6 0.30900 > 0.69100 -0.69100 0.54000 0.44000 O 2 Wat7 0.29500 0.59600 -0.59600 > 0.20000 0.15000 (I assume that O-H2 and O-H3 are hydroxyl ions, thus I indicate they have 1 hydrogen attached to each of them, but we need to check the original paper). Would your software process such markup? I think this is a good, standard, unambiguous way to indicate waters, without messing up with authors' data too much. (both '_atom_site_type_symbol' and '_atom_site_attached_hydrogens' are standard IUCr data names, the COD just adds a convention that _atom_site_type_symbol SHOULD contain the periodic system IUPAC atom name, or "D" for Deuterium; with possibly atom charge attached). Currently, we mark up the structures as we process them for ourselves; if an automated procedure can be devised for spotting all such entries (sure it can be done), we could add such fixes to all COD structures that requires it, if that would be helpful for you. There is already as set of structures (8509 COD entries) marked up in this way, e.g.: https://www.crystallography.net/cod/9004888.cif https://www.crystallography.net/cod/9003573.cif https://www.crystallography.net/cod/9002900.cif https://www.crystallography.net/cod/9000403.cif https://www.crystallography.net/cod/9001176.cif https://www.crystallography.net/cod/9001786.cif https://www.crystallography.net/cod/9001785.cif https://www.crystallography.net/cod/9009869.cif https://www.crystallography.net/cod/9009872.cif https://www.crystallography.net/cod/9009840.cif > 2. We check the values of Biso and Baniso, and we also convert > Baniso-values back to Biso values; > > in COD entry 9014636 there are very big Uaniso-values (converted into > Baniso values > 10), but small Uiso-values? > > Questions: does the COD apply a sanity check on the supplied B (or U) > values, and do you compare anisotropic with isotropic values? No, we do not check the Uij iso/aniso consistency so far... Thank you for the error report; such inconsistencies are for sure errors and need to be checked. I think we can relatively easy add this extra check into our pipeline. > We can easily give you a list of all COD entries which have > (converted) B-values > 10, if that helps. That would be very helpful. I can not promise that we fix them soon if there is a substantial manual work involved, but we will note the list in our COD bug list and try to deal with it ASAP. > I am still waiting for the original literature of these two examples > to exclude any input errors for the B?s. Yes, we should double check against the originals, this is very wise. I suspect some entries may contain scaling errors (B instead of U, or x10^3 vs x10^4 scale in tables), but we definitely need to check... > Best regards, and a happy new year 2020 to you! Many great thanks! Best, Saulius -- Dr. Saulius Gra?ulis Vilnius University Institute of Biotechnology, Saul?tekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366 This email and any files transmitted with it are confidential and maybe legally privileged. Such message is intended solely for the use of the individual or entity to whom they are addressed. Please notify the originator of the message if you are not the intended recipient and destroy all copies of the message. Please note that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From grazulis at ibt.lt Sat Jan 4 10:53:59 2020 From: grazulis at ibt.lt (=?UTF-8?Q?Saulius_Gra=c5=beulis?=) Date: Sat, 4 Jan 2020 10:53:59 +0200 Subject: [Cod-bugs] COD conversion with HighScore In-Reply-To: References: Message-ID: <1378226b-775f-bb93-f29c-2b0290641221@ibt.lt> Dear Thomas, On 2020-01-03 22:00, Thomas Degen wrote: > We are indeed processing "_atom_site_type_symbol", This is good news! I hope with this in mind we can then adapt a workable policy for the COD curation. Pls. see below. > but this information is missing for many COD entries. The _atom_site_type_symbol is indeed missing in most of the CIF supplied to the COD. Which is probably even good since this leaves us the unused data name which we can use for data curation, adding our values without changing the original data. > It would be great if the chemical element (type symbol) would be > unambiguously supplied for each Atom, we would very much appreciate > this. At the moment, we follow the IUCr definition of the _atom_site_label and _atom_site_type_symbol: https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_type_symbol.html: "A code to identify the atom species (singular or plural) occupying this site. /.../ The specification of this code is optional if component 0 of the _atom_site_label is used for this purpose" Thus, the _atom_site_type_symbol may be missing, and is indeed missing in most of the COD entries. In that case, we are supposed to use the first letters of the _atom_site_label, e.g.: Fe3+17 is Fe; C_a_phe_83_a_0 is C (carbon); O12 is oxygen. Now, I would be very reluctant to supply the _atom_site_type_symbol automatically since we can make mistakes; for example HO12 ? is it Holmium Ho or is it hydroxyl OH-? We had case where Ho was incorrectly inferred instead of hydroxyl, and I suspect we can have Ho species spelled in all caps as well. Thus, the addition of _atom_site_type_symbol *requires* manual inspection, and we physically can not do it for every COD entry (we soon will have half of the million :). So I suggest adding _atom_site_type_symbol *only* when the _atom_site_label is ambiguous or can be interpreted incorrectly, as spotted by processing software (so you logs are very important for COD data curation!). If both _atom_site_label and _atom_site_type_symbol are present, then the _atom_site_type_symbol should be used: https://www.iucr.org/__data/iucr/cifdic_html/1/cif_core.dic/Iatom_site_label.html: "The _atom_site_type_symbol always takes precedence over an _atom_site_label in the identification of the atom type." Thus, if we specify a (correct) _atom_site_type_symbol, it will override the _atom_site_label definition; but if there is no _atom_site_type_symbol, then _atom_site_label SHOULD be used for atom type identification. We would then go through all COD entries that still do not have _atom_site_type_symbol, and for those where _atom_site_label is ambiguous or has chance to be incorrect, we add _atom_site_type_label; the rest we leave untouched (minimal intervention). > Only in that case we can generate a correct diffraction pattern from > the atomic coordinates. Obviously. So, eventually I would suggest the algorithm for determining the atom type, in pseudocode: IF _atom_site_type_symbol exists, THEN Take the leading *letter* characters of _atom_site_type_symbol (e.g.: "O2-"->"O", "Ca2+"->"Ca"); IF the resulting string matches a known IUPAC atom name, THEN Use the resulting string as the atom type name; ELSE ERROR END IF (*inner IF*) ELSE The _atom_site_label MUST exist (else ERROR); Take the leading *letter* characters of _atom_site_label (e.g.: "O21"->"O", "Ca2+12"->"Ca"); IF the resulting string matches a known IUPAC atom name, THEN Use the resulting string as the atom type name; ELSE ERROR END IF (*inner IF*) END IF (*outer IF*) We could also try to correct capitalisation (CA->Ca, ho->Ho) in our algorithm, but this is probably too risky (again, is "ho" a hydroxyl or Ho? You never know what people were thinking...). One note: The IUCr gives examples of _atom_site_type_symbol as "Fe3+Ni2+"; this implies to me both Fe and Ni on an occupationally disordered site. The relative occupancies are not explicitly specified in such case, but the _atom_type_symbol loop MUST contain combined scattering factors used to refine species on this site. Hopefully we can handles such cases as well... It would be then good to check the resulting chemical formula from the atom coordinate entries with the formula provided by the authors. We on the COD side would scan the entries that produce error in this algorithm, merge that list with your logs where the COD entries produced crazy diffraction patterns (misinterpreting O as Ho should have ? and F_000 totally off, shouldn't it?), and then add manually the _atom_site_type_symbol to those entries that can be unambiguously corrected from the structure source. Would such policy be OK with you? From our side, it looks doable over time if the number of entries to be corrected is not dramatically large (i.e. within limits of thousands of entries). Regards, Saulius PS. I CC to the COD AB list since this concerns policies of the COD curation. -- Dr. Saulius Gra?ulis Vilnius University Institute of Biotechnology, Saul?tekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: