From grazulis at ibt.lt Mon Jan 13 09:23:14 2020 From: grazulis at ibt.lt (=?UTF-8?Q?Saulius_Gra=c5=beulis?=) Date: Mon, 13 Jan 2020 09:23:14 +0200 Subject: [Cod-bugs] COD conversion with HighScore In-Reply-To: References: <302131fd-91c5-7fde-1425-25711c20462d@ibt.lt> Message-ID: <24528fa0-cec1-0350-82c9-394845bc9b39@ibt.lt> Dear Thomas, I have finished a short inspection of the COD Uij problems. I attach the file with possible large Uij reasons identified, and a summary of the reason frequencies (files REASONS.lst and SUMMARY.txt, respectively. The REASONS.lst file can be read as CSV with TAB (ASCII 9) characters as column separators. I have only inspected files with some Bij > 300. There are 40 such files. Lowering threshold Bij > 200 would add extra 10. Lowering to 150 adds ~320 extra files (376 total), and at Bij > 100 we have 1106 total. Thus, Bij < 100 are very common, and Bij > 150 are relatively rare. On 2020-01-11 18:47, Thomas Degen wrote: > Concerning these many pattern having so many big displacement > parameters (which we don't see in other databases) My guess is that > the Units got confused. So it wasn?t U but the data was given as B or > Beta instead (and simply wrongly flagged as U). The main reasons for large Bij values, as I see them, are these: > saulius at koala Uiso/ $ head SUMMARY.txt > 15 Biso instead of Uiso > 6 Uij multiplied by 1E4? > 5 Digits missing from some Uij values? > 4 Bij instead of Uij > 3 Bij instead of Uij for just one atom? (???) Or refinement problems? > 2 Two Uij values stand out. Manual data entry error? Or refinement problems? > 2 Bij instead of Uij? > 1 Values for one atom ('C(1)') very large. Problems with refinement? > 1 Two Uij values stand out. Manual data entry error? Uij multiplied by 1E4? > 1 Two Uij values stand out. Manual data entry error? (Typed "9" instead of "0"?) I think we can reasonably fix the first four lines, which gives 15 + 6 + 5 + 4 = 30 corrected COD records. That's doable and most probably will be correct, but very little. The rest, IMHO, starts getting dubious. In many cases (say for "Uij multiplied by 1E4?") we should probably contact authors to verify that my interpretation of their files is correct. As for the bulk of the Bij>10 structures, I would say most are organics and have naturally higher Bij values than minerals. I'll discuss this on the COD AB, there the people have much more experience with small molecule crystals than me. Taking a random structure from the COD_Conv_Warnings.csv list: > #@ CODID AtLabel Uij data name Uij Bij> 4078519 C21 _atom_site_aniso_u_11 0.17300000 13.65953249 > 4078519 C21 _atom_site_aniso_u_33 0.16900000 13.34370515 > 4078519 C17 _atom_site_aniso_u_22 0.14600000 11.52769794 > 4078519 C19 _atom_site_aniso_u_22 0.13000000 10.26438858 > #@ label Uiso(CIF) Ueq(comp) Beq(comp) Uiso-Ueq > C21 0.115 0.114728 9.05854000 0.00027223 > C19 0.096 0.0964213 7.61312000 -0.00042125 > C65B 0.085 0.0852352 6.72990000 -0.00023515 > C17 0.0741 0.0744293 5.87670000 -0.00032931 > C16 0.0652 0.0651477 5.14386000 0.00005231 So, the largest Bij value (U11->B11 for C21) is 13.7, the largest Biso (again, for C21) is 9.1, and the structure, although it has some mild disorder, looks pretty normal to me. Ueqiv computed from the Uij are consistent, within error, with the values provided in Uiso. Do you think Bij > 10 indicates a problem here? Sincerely yours, Saulius -- Dr. Saulius Gra?ulis Vilnius University Institute of Biotechnology, Saul?tekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366 -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- #@CODID Problem Author Possible reason 1544873 some Bij>300 S.G. Reason for large Uij not clear at all. 1544907 some Bij>300 S.G. Reason for large Uij not clear at all. Similar to 1544873 2000571 some Bij>300 S.G. Uij multiplied by 1E4? 2000642 some Bij>300 S.G. Uij multiplied by 1E4? 2000721 some Bij>300 S.G. Uij multiplied by 1E4? 2002110 some Bij>300 S.G. Uij multiplied by 1E4? 2002111 some Bij>300 S.G. Uij multiplied by 1E4? 2005112 some Bij>300 S.G. Digits missing from some Uij values? Manual data entry error? 2005689 some Bij>300 S.G. Extra digits for some Uij values? Manual data entry error? 2006293 some Bij>300 S.G. Bij instead of Uij 2006294 some Bij>300 S.G. Bij instead of Uij 2006295 some Bij>300 S.G. Bij instead of Uij 2009417 some Bij>300 S.G. Biso instead of Uiso. Uij multiplied by 1E4? 2009425 some Bij>300 S.G. Uij multiplied by 1E4? 2010272 some Bij>300 S.G. Bij instead of Uij 2101928 some Bij>300 S.G. Bij instead of Uij for just one atom? (???) Or refinement problems? 2102017 some Bij>300 S.G. Two Uij values stand out. Manual data entry error? Or refinement problems? 2201604 some Bij>300 S.G. One Uij value stands out. Manual data entry error? Digits missing from some Uij values? 4061132 some Bij>300 S.G. Values for one atom ('C(1)') very large. Problems with refinement? 4114108 some Bij>300 S.G. One Uij value stands out. Problems with refinement? 4114109 some Bij>300 S.G. Two Uij values stand out. Manual data entry error? Or refinement problems? 4114580 some Bij>300 S.G. Digits missing from some Uij values? 4114581 some Bij>300 S.G. Digits missing from some Uij values? 4115051 some Bij>300 S.G. Digits missing from some Uij values? 4115055 some Bij>300 S.G. Digits missing from some Uij values? 4115066 some Bij>300 S.G. Digits missing from some Uij values? 4116019 some Bij>300 S.G. Bij instead of Uij for just one atom? (???) Or refinement problems? 4307487 some Bij>300 S.G. Bij instead of Uij for just one atom? (???) Or refinement problems? 4322175 some Bij>300 S.G. Digits missing from some Uij values? Problems with refinement? 4322875 some Bij>300 S.G. The first (U11) value on *some*, but not *all*, hydrogens seems to be converted to B instead of U (???) 9004552 some Bij>300 S.G. Some atoms seem to have Uij, some probably have Biso specified as U11. Manual data entry error? 9007611 some Bij>300 S.G. Heavy atoms seem to have Uij, hydrogens probably have Biso specified as U11. Manual data entry error? 9009485 some Bij<-200 S.G. Large negative U23 for some atoms. Problems with refinement? 9013813 some Bij>300 S.G. Two Uij values stand out. Manual data entry error? (Typed "9" instead of "0"?) 9013821 some Bij>300 S.G. One Uij value stands out. Manual data entry error? 9014030 some Bij>300 S.G. Two Uij values stand out. Manual data entry error? 9014636 some Bij>300 S.G. One Uij value stands out. Manual data entry error? Or refinement problems? 9014842 some Bij>300 S.G. Three Uij values stand out. Manual data entry error? Uij multiplied by 1E4? 9014997 some Bij>300 S.G. Bij instead of Uij? 9016254 some Bij>300 S.G. Two Uij values stand out. Manual data entry error? Uij multiplied by 1E4? 9016691 some Bij>300 S.G. Bij instead of Uij? 2001154 Uiso-Ueq>1 S.G. Biso instead of Uiso 2001156 Uiso-Ueq>1 S.G. Biso instead of Uiso 2003303 Uiso-Ueq>1 S.G. Biso instead of Uiso 2003596 Uiso-Ueq>1 S.G. Biso instead of Uiso 2004328 Uiso-Ueq>1 S.G. Biso instead of Uiso; bad orthogonalisation? 2004354 Uiso-Ueq>1 S.G. Biso instead of Uiso 2004427 Uiso-Ueq>1 S.G. Biso instead of Uiso 2004531 Uiso-Ueq>1 S.G. Biso instead of Uiso 2004782 Uiso-Ueq>1 S.G. Biso instead of Uiso 2004836 Uiso-Ueq>1 S.G. Biso instead of Uiso 2005572 Uiso-Ueq>1 S.G. Biso instead of Uiso 2006511 Uiso-Ueq>1 S.G. Biso instead of Uiso; problems with orthogonalisation? 2011176 Uiso-Ueq>1 S.G. Biso instead of Uiso 4320747 Uiso-Ueq>1 S.G. Biso instead of Uiso 4321814 Uiso-Ueq>1 S.G. Biso instead of Uiso 4323429 Uiso-Ueq>1 S.G. Biso instead of Uiso 8101564 Uiso-Ueq>1 S.G. Biso instead of Uiso -------------- next part -------------- 15 Biso instead of Uiso 6 Uij multiplied by 1E4? 5 Digits missing from some Uij values? 4 Bij instead of Uij 3 Bij instead of Uij for just one atom? (???) Or refinement problems? 2 Two Uij values stand out. Manual data entry error? Or refinement problems? 2 Bij instead of Uij? 1 Values for one atom ('C(1)') very large. Problems with refinement? 1 Two Uij values stand out. Manual data entry error? Uij multiplied by 1E4? 1 Two Uij values stand out. Manual data entry error? (Typed "9" instead of "0"?) 1 Two Uij values stand out. Manual data entry error? 1 Three Uij values stand out. Manual data entry error? Uij multiplied by 1E4? 1 The first (U11) value on *some*, but not *all*, hydrogens seems to be converted to B instead of U (???) 1 Some atoms seem to have Uij, some probably have Biso specified as U11. Manual data entry error? 1 Reason for large Uij not clear at all. Similar to 1544873 1 Reason for large Uij not clear at all. 1 Possible reason 1 One Uij value stands out. Problems with refinement? 1 One Uij value stands out. Manual data entry error? Or refinement problems? 1 One Uij value stands out. Manual data entry error? Digits missing from some Uij values? 1 One Uij value stands out. Manual data entry error? 1 Large negative U23 for some atoms. Problems with refinement? 1 Heavy atoms seem to have Uij, hydrogens probably have Biso specified as U11. Manual data entry error? 1 Extra digits for some Uij values? Manual data entry error? 1 Digits missing from some Uij values? Problems with refinement? 1 Digits missing from some Uij values? Manual data entry error? 1 Biso instead of Uiso; problems with orthogonalisation? 1 Biso instead of Uiso; bad orthogonalisation? 1 Biso instead of Uiso. Uij multiplied by 1E4? -------------- next part -------------- A non-text attachment was scrubbed... Name: grazulis.vcf Type: text/x-vcard Size: 4 bytes Desc: not available URL: From thomas.degen at panalytical.com Mon Jan 13 09:27:25 2020 From: thomas.degen at panalytical.com (Thomas Degen) Date: Mon, 13 Jan 2020 07:27:25 +0000 Subject: [Cod-bugs] COD conversion with HighScore In-Reply-To: <60427be7-9b95-2725-af7c-90341b201eeb@ibt.lt> References: <302131fd-91c5-7fde-1425-25711c20462d@ibt.lt> <60427be7-9b95-2725-af7c-90341b201eeb@ibt.lt> Message-ID: Dear Saulius, > Do you take Bij values after transforming them to an orthogonal frame, or in the original CIF Uij frame of reference? We convert displacement parameters (and their standard deviations) from any source from U or Beta to B. Where U to B is a factor 8*Sqr(Pi) an Beta to B is like that: Atom.B11.Value := 4 * (Beta11.Value / Sqr(ReciprocalAxis.a)); Atom.B22.Value := 4 * (Beta22.Value / Sqr(ReciprocalAxis.b)); Atom.B33.Value := 4 * (Beta33.Value / Sqr(ReciprocalAxis.c)); Atom.B12.Value := 4 * (Beta12.Value / ReciprocalAxis.a * ReciprocalAxis.b); Atom.B13.Value := 4 * (Beta13.Value / ReciprocalAxis.a * ReciprocalAxis.c); Atom.B23.Value := 4 * (Beta23.Value / ReciprocalAxis.b * ReciprocalAxis.c); And then keep them like that in our document. Best regards, Thomas -----Original Message----- From: Saulius Gra?ulis Sent: Sunday, January 12, 2020 6:01 PM To: Thomas Degen ; Thomas Dortmann ; cod-bugs at ibt.lt Cc: Thomas Dortmann Subject: Re: [Cod-bugs] COD conversion with HighScore Dear Thomas, thank you for the clarifications! On 2020-01-11 18:47, Thomas Degen wrote: > No, it is what it says, it is about how often any of the anisotropic values (B11,B22,B33,B12,B23,B13) hit the "B >= 10" limit. I see. I misread the table, assuming 'Banis' is in fact Beq computed from Bij. Now I can mostly get the same counts as in your table; occasionally I still get somewhat less Bij >= 10 as in your table (e.g. for COD 7123444 I get 51 large Bij value instead of 54). But the results are now very similar, so I can proceed with the survey of the issues. Do you take Bij values after transforming them to an orthogonal frame, or in the original CIF Uij frame of reference? > We are using B instead of U in the UI because these values are closer to 1, which makes them more convenient to look at. OK, clear. * 8 * pi ** 2 ? >> Could you please send my your code example > This is approximately the code to convert Banis to Biso: > > case FSpaceGroup.SimpleCrystalSystem of > scTriclinic: > begin > Atom.Biso.Value := (1 / 3) * > (Atom.b11.Value * Sqr(Cell.a * ReciprocalCell.a) + Atom.b22.Value * Sqr(Cell.b * ReciprocalCell.b) > + Atom.b33.Value * Sqr(Cell.c * ReciprocalCell.c) > + 2 * Atom.b12.Value * Cell.a * Cell.b * ReciprocalCell.a * ReciprocalCell.b * CosDeg(Cell.gamma) > + 2 * Atom.b13.Value * Cell.a * Cell.c * ReciprocalCell.a * ReciprocalCell.c * CosDeg(Cell.beta) > + 2 * Atom.b23.Value * Cell.b * Cell.c * ReciprocalCell.b * ReciprocalCell.c * CosDeg(Cell.alpha)); > end; Perfect. This is the same formula as Fischer1988 describes, and which I use as well. The values should be reproducible (even though I do not use special cases for higher-symmetry groups). > Concerning these many pattern having so many big displacement > parameters (which we don't see in other databases) My guess is that the Units got confused. So it wasn?t U but the data was given as B or Beta instead (and simply wrongly flagged as U). In some cases, the unites are indeed mixed up, but I now find a bunch of other reasons. I'll let you know in a follow-up mail. Sincerely yours, Saulius -- Dr. Saulius Gra?ulis Vilnius University Institute of Biotechnology, Saul?tekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366 This email and any files transmitted with it are confidential and maybe legally privileged. Such message is intended solely for the use of the individual or entity to whom they are addressed. Please notify the originator of the message if you are not the intended recipient and destroy all copies of the message. Please note that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From thomas.dortmann at panalytical.com Mon Jan 13 12:43:06 2020 From: thomas.dortmann at panalytical.com (Thomas Dortmann) Date: Mon, 13 Jan 2020 10:43:06 +0000 Subject: [Cod-bugs] COD conversion with HighScore In-Reply-To: <24528fa0-cec1-0350-82c9-394845bc9b39@ibt.lt> References: <302131fd-91c5-7fde-1425-25711c20462d@ibt.lt> <24528fa0-cec1-0350-82c9-394845bc9b39@ibt.lt> Message-ID: Hi Saulius, a) concerning atom-labels: we fixed the "Wat" atom-label in our conversion and now the number of pattern containing the (wrong!) element Astatine is down from 1180 to 8! These eight remaining patterns are all minerals, where water is coded as "OWat(n)" instead of "Wat"; these are the corresponding CIF's: 9006364 - 9006368 9014246 9014312 9016377 b) concerning large Baniso and Biso values: Both Thomas Degen and myself, we have much more experience with inorganic crystal structures than with organic crystal structures. So it is well possible, that our warning threshold for B > 10 is too low. On the other hand I thought many organic structures are solved from low temperature data (< 100 K), with thermal vibrations being lower than at room temperature? Anyway, we are trying to get better statistics about this point from other databases, but this could take a while. I will come back on this issue when we have news about it. Best regards, Thomas Dortmann -----Original Message----- From: Saulius Gra?ulis Sent: 13 January 2020 08:23 To: Thomas Degen ; Thomas Dortmann ; cod-bugs at ibt.lt Cc: Thomas Dortmann Subject: Re: [Cod-bugs] COD conversion with HighScore Dear Thomas, I have finished a short inspection of the COD Uij problems. I attach the file with possible large Uij reasons identified, and a summary of the reason frequencies (files REASONS.lst and SUMMARY.txt, respectively. The REASONS.lst file can be read as CSV with TAB (ASCII 9) characters as column separators. I have only inspected files with some Bij > 300. There are 40 such files. Lowering threshold Bij > 200 would add extra 10. Lowering to 150 adds ~320 extra files (376 total), and at Bij > 100 we have 1106 total. Thus, Bij < 100 are very common, and Bij > 150 are relatively rare. On 2020-01-11 18:47, Thomas Degen wrote: > Concerning these many pattern having so many big displacement > parameters (which we don't see in other databases) My guess is that > the Units got confused. So it wasn?t U but the data was given as B or > Beta instead (and simply wrongly flagged as U). The main reasons for large Bij values, as I see them, are these: > saulius at koala Uiso/ $ head SUMMARY.txt > 15 Biso instead of Uiso > 6 Uij multiplied by 1E4? > 5 Digits missing from some Uij values? > 4 Bij instead of Uij > 3 Bij instead of Uij for just one atom? (???) Or refinement problems? > 2 Two Uij values stand out. Manual data entry error? Or refinement problems? > 2 Bij instead of Uij? > 1 Values for one atom ('C(1)') very large. Problems with refinement? > 1 Two Uij values stand out. Manual data entry error? Uij multiplied by 1E4? > 1 Two Uij values stand out. Manual data entry error? (Typed "9" > instead of "0"?) I think we can reasonably fix the first four lines, which gives 15 + 6 + 5 + 4 = 30 corrected COD records. That's doable and most probably will be correct, but very little. The rest, IMHO, starts getting dubious. In many cases (say for "Uij multiplied by 1E4?") we should probably contact authors to verify that my interpretation of their files is correct. As for the bulk of the Bij>10 structures, I would say most are organics and have naturally higher Bij values than minerals. I'll discuss this on the COD AB, there the people have much more experience with small molecule crystals than me. Taking a random structure from the COD_Conv_Warnings.csv list: > #@ CODID AtLabelUij data name Uij Bij> 4078519C21_atom_site_aniso_u_11 0.17300000 13.65953249 > 4078519C21_atom_site_aniso_u_33 0.16900000 13.34370515 > 4078519C17_atom_site_aniso_u_22 0.14600000 11.52769794 > 4078519C19_atom_site_aniso_u_22 0.13000000 10.26438858 > #@ label Uiso(CIF)Ueq(comp)Beq(comp)Uiso-Ueq > C21 0.115 0.114728 9.05854000 0.00027223 > C19 0.096 0.0964213 7.61312000 -0.00042125 > C65B 0.085 0.0852352 6.72990000 -0.00023515 > C17 0.0741 0.0744293 5.87670000 -0.00032931 > C16 0.0652 0.0651477 5.14386000 0.00005231 So, the largest Bij value (U11->B11 for C21) is 13.7, the largest Biso (again, for C21) is 9.1, and the structure, although it has some mild disorder, looks pretty normal to me. Ueqiv computed from the Uij are consistent, within error, with the values provided in Uiso. Do you think Bij > 10 indicates a problem here? Sincerely yours, Saulius -- Dr. Saulius Gra?ulis Vilnius University Institute of Biotechnology, Saul?tekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366 -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. This email and any files transmitted with it are confidential and maybe legally privileged. Such message is intended solely for the use of the individual or entity to whom they are addressed. Please notify the originator of the message if you are not the intended recipient and destroy all copies of the message. Please note that any use, dissemination, or reproduction is strictly prohibited and may be unlawful. -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From grazulis at ibt.lt Mon Jan 13 21:21:40 2020 From: grazulis at ibt.lt (=?UTF-8?Q?Saulius_Gra=c5=beulis?=) Date: Mon, 13 Jan 2020 21:21:40 +0200 Subject: [Cod-bugs] COD conversion with HighScore In-Reply-To: References: <302131fd-91c5-7fde-1425-25711c20462d@ibt.lt> <24528fa0-cec1-0350-82c9-394845bc9b39@ibt.lt> Message-ID: <98d13001-c538-8374-cf22-9d094d40cc51@ibt.lt> Dear Thomas, On 2020-01-13 12:43, Thomas Dortmann wrote: > a) concerning atom-labels: > > we fixed the "Wat" atom-label in our conversion and now the number of pattern containing the (wrong!) element Astatine is down from 1180 to 8! > These eight remaining patterns are all minerals, where water is coded as "OWat(n)" instead of "Wat"; > these are the corresponding CIF's: > 9006364 - 9006368 > 9014246 > 9014312 > 9016377 This is good news. I'm very happy that your conversion software runs nearly 100%! On my side, I went through the "uncanonical" atom names in the COD (http://saulius-grazulis.lt/~saulius/.d981490889b10e82e8f6943bbfd569aaebf1c8c3/). In the file "estimated-atom-types.lst", the first column is the estimated atom type, the second is the atom name in the corresponding CIF, and the third is the number of occurrences of this atom name. The "DOUBLE_CHECK.lst" contains a manually compiled list of atom types that are most probably wrong after automatic detection and will need to be inspected by a human. The policy I would adopt is the following: a/ If a CIF already contains _atom_site_type_symbol, we do nothing. Reason: the _atom_site_type_symbol is either added manually by COD curators (in this case we do not want to undo our manual work), or it is provided by CIF authors. Among the atom type symbols, most common irregularity is the symbols in all lowercase, or the symbols in all uppercase. These can be dealt by regularising case an looking up in a table; e.g. we do: ucfirst(lc($atom_type_symbol)), where lc($string) returns all-lowercase version of the argument string, and ucfirst($string) returns the string with the first letter uppercases, yielding "Ca" from both "ca" and "CA", which is mostly correct. From 14742 atoms in the COD that have _atom_site_type_symbol values, only 28 could not be interpreted in this way ? a negligible amount, and non-correctable even manually. Since, as I understand, you software already incorporates this heuristics, atoms with _atom_site_type_symbol will not be a problem, will it? b/ If the atom does *not* have the _atom_site_type_symbol, we will guess its type from the atom label. If the leading non-digit characters of the atom label yield a valid periodic system element name, we do nothing. If the leading non-digit characters of the atom site label do *not* yield a recognisable atom name, we apply heuristics as noted in the estimated-atom-types.lst.log in the Web page cited above; in Perl: $n1 = ucfirst(substr($atom_site_label,0,1)); $n2 = ucfirst(lc(substr($atom_site_label,0,2))); if( $atom_site_label =~ /^Wat[A-Za-z0-9\(\)]*$/ ) { $atom_site_type_symbol = "O" } elsif( exists $COD::AtomProperties::atoms{$n2} ) { $atom_site_type_symbol = $n2 } elsif( exists $COD::AtomProperties::atoms{$n1} ) { $atom_site_type_symbol = $n1 } else { $atom_site_type_symbol = "?"; print STDERR "$0: WARNING, atom type for atom \"$F[1]\" is not recognised\n" } We then compute the summary formula with the new atom types, and compare it with the formula provided by the authors. If the summary formulae match, we add the _atom_site_type_symbol to the CIF. If not, we report an error. After this, we double-check the atom types mentioned in DOUBLE_CHECK.lst. The new modified CIFs will have recognisable (standard) element names in _atom_site_type_symbol, and will have correct chemical formula computable from atom records (correct means the same as provided by the author). The results will be like to those in estimated-atom-types.lst. The new CIFs may only break the heuristics in you program if: 1/ we guess the atom types wrongly, 2/ the authors provided an incorrect summary chemical formula 3/ the two incorrect formulas match by pure accident, 4/ your heuristics gets atoms types correctly. or 1'/ we make two mistakes that compensate each other exactly (e.g. Ca->C on one site, and C->Ca on another site) and still get the correct formula with incorrect atom site assignments. I regard coincidence of these events highly unlikely. Also, when detected, the _atom_site_type_symbol values can be curated manually and will *not* be overridden again by automatic software. If you find such COD curation policy acceptable, we proceed with its implementation at some time in the future, and add it to our automatic pipelines (but without the manual check stage for every incoming file...). I CC this e-ail to the COD AB for discussion and eventual policy approval. Regards, Saulius -- Dr. Saulius Gra?ulis Vilnius University Institute of Biotechnology, Saul?tekio al. 7 LT-10257 Vilnius, Lietuva (Lithuania) fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 mobile: (+370-684)-49802, (+370-614)-36366 -------------- next part -------------- Na NaAm 138 How to represent Na and Am on the same site? Is it really two metals on the same site? Mo MO 118 Is it Mo or a metal site "M" with index "O"? Na NaAM 38 Is it Na and Am on the same site, or Na on the site indexed "AM"? La LaNa 22 How to represent La and Na on the same site? Ca CaMO 21 Is it Ca and Mo on the same site, or Ca on the site indexed "MO"? K KNa 18 How to represent K and Na on the same site? Is it really two metals on the same site? Sr SrNa 16 How to represent Sr and Na on the same site? Nd NdNa 14 How to represent Nd and Na on the same site? Is it really two metals on the same site? Si SiTl 11 Are these really Si and Tl on the same site? Ca CaAm 9 Are these really Si and Tl on the same site? Al AlTc 9 Are these really Si and Tl on the same site? Al AlTA 9 Is it Al and Ta, or Al on the site indexed "TA"? Si SiTc 8 Are these really Si and Tc on the same site? Na NaCa 8 How to represent Na and Ca on the same site? Is it really two metals on the same site? Si SiTb 7 Are these really Si and Tb on the same site? Si SiTa 7 Are these really Si and Ta on the same site? Nb NbTi 7 Are these really Si and Ti on the same site? Nb NbMO 7 Is it Nb and Mo, or Nb on the site indexed "MO"? Os OSi 6 Os or Si? O ON 6 O or mixed site O/N? F FXOM 6 O or mixed site O/F? K KBa 5 K or mixed K/Ba? Nd NdCa 4 How to present Na and Cd on the same site? Cu CuMn 3 How to present Cu and Mn on the same site? W WO 2 W or water? W WB 2 W or water? Ca CaSr 2 How to present Ca and Sr on the same site? Ca CaNac 2 Ca or Ca/Na? Tl TlI 1 Tl or Tl/I? Ti TiSi 1 How to present Ti and Si on the same site? Ti TiP 1 How to present Ti and P on the same site? Ti TiFe 1 How to present Ti and Fe on the same site? Sn SnTi 1 How to present Sn and Ti on the same site? Si SiNb 1 How to present Si and Nb on the same site? S STe 1 How to present S and Te on the same site? Pr PrNa 1 Pr or Pr/Na? Pb PbBa 1 How to present Pb and Ba on the same site? Pa PAs 1 How to present P and As on the same site? Os OSl 1 Os or O on site "Sl"? O OCl 1 O or O/Cl? Na NaCl 1 Na and Cl on the same site?! Na NaBa 1 How to present Na and Ba on the same site? Mg MgCu 1 How to present Mg and Cu on the same site? Mg MgCA 1 Mg or Mg/Ca? Mg MgBA 1 Mg or Mg/Ba? Li LiAm 1 How to present Li and Am on the same site? K KCl 1 K and Cl on the same site?! Ho HOh 1 Ho or H on the OH- site? Ge GeGa 1 How to present Ge and Ga on the same site? Fe FeCu 1 How to present Fe and Cu on the same site? Cu CuMg 1 How to present Cu and Mg on the same site? Ca CaMn 1 How to present Ca and Mn on the same site? Ca CaFe 1 How to present Ca and Fe on the same site? Ca CaAl 1 How to present Ca and Al on the same site? Cm CMe 1 Cm or C on the "Me" site? Ba BaCa 1 How to present Ba and Ca on the same site? Hs HS 1 Hs, H or S? Ge GeGa 1 How to present Ge and Ga on the same site? Fe FeCu 1 How to present Fe and Cu on the same site? Cu CuMg 1 How to present Cu and Mg on the same site? Ce CeCad 1 Ce or Ce/Ca? Ce CeCac 1 Ce or Ce/Ca? Ce CeCab 1 Ce or Ce/Ca? Ce CeCaa 1 Ce or Ce/Ca? Ce CeC 1 Ce or Ce/C? Ca CaFe 1 How to present Ca and Fe on the same site? Ca CaAl 1 Ca or Ca/Al? How to present Ca and Al on the same site? Ba BaCa 1 How to present Ba and Ca on the same site? As AsSb 1 How to present As and Sb on the same site? Al AlMg 1 How to present Al and Mg on the same site? Al AlFe 1 How to present Al and Fe on the same site? -------------- next part -------------- H h 5018 C c 4653 O Wat 4264 Mg MgM 2775 Fe FeM 2341 Si SiT 2169 Al AlT 2103 Al AlM 1602 N n 1515 Mn MnM 1095 Ti TiM 828 Ca CaM 825 Mg MgT 758 Cl CL 686 Na NaA 649 O o 555 Fe FeT 550 Na NaM 539 O OW 520 K KA 455 Ca CaX 435 O Ow 426 O OA 423 O OC 362 Na NaX 361 Ca CaA 334 Al AlY 332 Cr CrM 325 O OB 313 Mn MN 286 Si SiA 276 Si SiB 273 Si SI 263 Zn ZnT 252 Al AlZ 242 Fe FE 224 O OD 216 Li LiM 208 Zn ZnM 182 Fe FeY 180 Mg MgY 175 F f 174 O Oa 173 Li LiY 163 Ti TiY 161 S SX 155 Ni NiM 155 K KX 154 Se SeX 150 Mn MnY 149 Fe FeO 148 Ni NI 139 Na NaAm 138 Cu CU 138 Si SiZ 137 Fe FeA 137 Fe FeB 134 Mn MnT 125 Pb PbM 124 Mg MgO 124 Mo MO 118 Cu CuM 117 Al AlB 117 Bi BiMe 111 Ti TiT 109 Mg MgA 109 Cl cl 106 Sr SrA 106 Bi BiM 106 Mg MG 104 Si SiM 103 Pb PbMe 103 K KAm 102 Mn MnA 100 Nb NbM 97 Cl ClX 96 Mg MgB 94 Ca CA 93 Co CoM 91 P p 90 Mg MgZ 86 Co CO 86 Ce CeM 81 B BT 81 Si SiD 80 Mg MgX 80 Si SiC 78 Zr ZrM 74 V VM 74 O Oc 73 Si SiF 69 Si SiE 69 Sb SbM 69 K KC 69 Ti TiB 68 Al AlA 67 Os OS 66 Pb PB 62 K KM 62 Ce CeA 62 La LA 60 Ga GaM 60 Ba BaA 60 Na NaB 59 O OX 55 Ca CaC 55 Na NA 54 Br BR 54 O Ob 53 Ti TiA 52 Fe FeX 52 Mn MnX 51 Si SiTB 50 Si SiTA 50 Pb PbX 50 Si SiG 48 Na NaC 48 S s 47 Si SiH 47 Li LI 47 Y YM 46 La LaM 46 H HW 46 Zn ZnY 45 Si SiL 45 Si SiK 45 Si SiJ 45 O Odo 45 Nd NdM 45 La LaA 45 Fe FeZ 45 Ca CaB 45 Al AL 44 O WatA 43 O Oco 43 O Obo 43 O WatX 42 Ru RU 42 Ge GE 42 ? M 41 K KAM 41 H HI 41 D DZW 41 Sr SrM 39 O Odm 39 O Ocm 39 O Obm 39 Ir IR 39 Na NaAM 38 H HC 38 Bi BI 38 O OIIB 37 Ce CE 37 O WatB 36 Sr SR 36 Ga GaT 36 Pb PbA 35 N NH 35 Nb NB 35 Ge GeT 35 I IS 34 Ta TaB 33 Sc SC 33 Nb NbB 33 Pt PT 32 Zn ZN 31 Mn MnB 31 Cl ClS 31 Be BeT 31 Mg MgC 30 F FX 30 As as 29 O WatO 29 Se SE 29 Rb RbA 29 Na NaMO 29 Cu CuZn 29 Cr CrY 29 C CH 29 Sm SmM 28 O OI 28 Pr PrM 27 Zr ZR 26 Ti TiZ 26 Ti TI 26 O Od 26 Nd NdA 26 As AsT 26 V VY 25 O ODm 25 O OCm 25 O OBm 25 Mn MnMO 25 Ce CeNa 25 O WatZ 24 Sb SbA 24 O OT 24 Nb NbA 24 Gd GdM 24 Cr CrB 24 As AsM 24 Y YA 23 Si Sia 23 Sb SB 23 Cr CR 23 Al AlX 23 Au AU 23 Ir ir 22 Zn ZnCu 22 O WatC 22 Ti TiD 22 Sc ScM 22 S SA 22 O Op 22 La LaNa 22 Co CoT 22 Ag AgM 22 Pt pt 21 D d 21 Tl TL 21 Sr SrX 21 Sb SbB 21 Nb NbD 21 Cu CuT 21 Ca CaMO 21 Ba BA 21 O OII 20 O OE 20 Na NaK 20 Ca CaNa 20 Br br 19 Si Sil 19 Re RE 19 Pr PrA 19 Na NaN 19 Fe FeMe 19 Dy DyM 19 Cu CuB 19 Ta TaA 18 O OWat 18 O OV 18 Ni NiO 18 Na NaCh 18 K KNa 18 K KB 18 Ga GA 18 Ca Cal 18 Ba BaAP 18 As AsX 18 Ti TiMH 17 O OIII 17 O ODo 17 O OCo 17 O OBo 17 Mg MgOct 17 Sr SrNa 16 Rh RH 16 Cs CS 16 Al Ala 16 Al AlTB 16 Th ThA 15 Si Sib 15 Ni NiT 15 Li LiT 15 Fe FeC 15 Cu CuA 15 Cs CsA 15 Ca CaW 15 Si si 14 Zr ZrMO 14 Tm TM 14 Sr SrAP 14 Si SiY 14 O Oab 14 O OZW 14 Ni NiMe 14 Nd NdNa 14 Hg HG 14 Cd CdB 14 Mo mo 13 Fe fe 13 Zr ZrY 13 O WatW 13 Ta TA 13 Sn SnM 13 Sm SmA 13 Sn SN 13 O OIB 13 Mg MgMO 13 H Hw 13 Cl ClA 13 Cd CdA 13 Co co 12 Ti Tioct 12 Ti TiMO 12 Ti TiC 12 Ta TaM 12 S ST 12 O Oin 12 O OXOM 12 O OL 12 O OCr 12 Nb NbC 12 H HII 12 Fe FeMO 12 Ca CaI 12 Cd CD 12 Ba BaM 12 Al Aloct 12 I i 11 V VT 11 Si SiTl 11 Si SiTd 11 Mn MnZ 11 Lu LU 11 K KAP 11 Gd GdA 11 F FXOA 11 D Dw 11 Cu CuMe 11 Co CoMe 11 Ca CaK 11 Ca CaII 11 Ag AgMe 11 B b 10 Pd PD 10 O OF 10 Nb NbSi 10 Nb NbMH 10 N NHA 10 Mn MnC 10 Ca Cac 10 Ca CaIIB 10 Ba BaB 10 Al All 10 Al AlTl 10 Al AlOct 10 Al AlMH 10 W w 9 Th ThM 9 Sn SnB 9 Sc ScY 9 Pa PA 9 O OIIA 9 Nb NbY 9 Mn MnMe 9 Li LiA 9 In InM 9 F FO 9 Er ErA 9 Eu EU 9 Ca CaY 9 Ca CaAm 9 Ca CaAP 9 Al AlTc 9 Al AlTA 9 Ag AG 9 Ni ni 8 Al al 8 Zn ZnA 8 ? X 8 Si SiTc 8 O OCh 8 Na NaI 8 Na NaCa 8 Mn MnD 8 Ge GeA 8 Dy DyA 8 D Dd 8 Ca Cab 8 Ca CaCS 8 Bi BiME 8 Na na 7 In in 7 Cr cr 7 O WatII 7 O WatD 7 U UA 7 Si SiTet 7 Si SiTb 7 Si SiTa 7 Si SiO 7 Rb RB 7 O Ot 7 O OIA 7 Ni NiA 7 Nb NbTi 7 Nb NbMO 7 Na Nal 7 Nd ND 7 K KK 7 Hf HfM 7 Gd GD 7 Fe Fea 7 Fe FeMT 7 F FW 7 D Dwa 7 Cr CrZ 7 Ce CeB 7 Cd CdM 7 Ca Cazoo 7 Ca Caooo 7 Ba BaC 7 Be BE 7 Al AlTd 7 Zn zn 6 Li li 6 Au au 6 Ag ag 6 Zr ZrA 6 Yb YbA 6 Y YMH 6 Y YAP 6 O Watl 6 O WatI 6 O WatCh 6 Ta TaC 6 ? T 6 Sr SrB 6 Si SiVI 6 Si SiOc 6 Si SiIV 6 Ra RA 6 Pr PrMH 6 Pr PrAP 6 Pr PR 6 O Oint 6 O Odz 6 O Ocz 6 Os OSi 6 O ON 6 Nd NdMH 6 Nd NdB 6 Nd NdAP 6 Mg Mgl 6 Mg MgMe 6 Mg MgD 6 La LaMH 6 La LaB 6 La LaAP 6 K KI 6 In IN 6 H Hy 6 H Hl 6 Ho HO 6 Ge GeZ 6 Ge GeB 6 Fe Fel 6 Fe FeD 6 Fe FeAo 6 F FXOM 6 Cr CrX 6 Ce CeMH 6 Ce CeAP 6 Ca Cazio 6 Ca Caoio 6 Ca CaZ 6 Ca CaMH 6 Bi BiC 6 Al Alb 6 Al AlTb 6 Al AlTa 6 Ru ru 5 Cu cu 5 Zr ZrTa 5 Zr ZrB 5 O WatT 5 O WatM 5 O WatJ 5 O WatG 5 O WatF 5 U UM 5 Sn SnA 5 Sm SmMH 5 Sb SbMe 5 O Oo 5 O Odoz 5 O Odoo 5 O Odmz 5 O Odmo 5 O Ocoz 5 O Ocoo 5 O Ocmz 5 O Ocmo 5 O Oboz 5 O Oboo 5 O Obmz 5 O Obmo 5 O OIV 5 Na NaNa 5 Na NaII 5 Na NaBP 5 Mg MgTM 5 Mg MgMT 5 K KD 5 K KBa 5 Fe Feoct 5 Fe Feb 5 Er ER 5 Cd CdT 5 Ca CaMe 5 Ca CaIIA 5 Ca CaIB 5 Ca CaD 5 Ca CaBP 5 Ba BaK 5 Y y 4 Zr ZrZ 4 Zr ZrTi 4 O WatXPA 4 O WatH 4 W WM 4 V VX 4 Tm TmM 4 Ti TiMI 4 Sr SrN 4 Sr SrCS 4 Sr SrC 4 Sm SmAP 4 Si Sic 4 Si SiX 4 Si SiII 4 Sc ScA 4 Sb Sbl 4 Sb SbX 4 S SASH 4 Pd Pda 4 O Obz 4 O OU 4 O OH 4 O OCH 4 Nd NdCa 4 Nb NbMI 4 Nb NbII 4 Nb NbI 4 Na NaCS 4 Mn MnMn 4 Mg MgMD 4 K KCS 4 Hg HgM 4 Gd GdMH 4 Fe FeOct 4 Fe FeMD 4 Fe FeI 4 Fe FeFe 4 Fe FeAm 4 F FXPM 4 D Dda 4 D DWH 4 Cu Cul 4 Cs CsC 4 C CN 4 Be BeZ 4 Ba BaCS 4 B BZ 4 Zr zr 3 Pd pd 3 Mn mn 3 Zr ZrX 3 Zr ZrD 3 Zn ZnMe 3 Yb YbM 3 Y YX 3 O WatXPM 3 O WatIV 3 O WatE 3 V VdZ 3 V VMn 3 Ti TidZ 3 Ti TiX 3 Te TeX 3 Sn SnT 3 Sn SnC 3 Si Sif 3 Si Sie 3 Si Sid 3 Si SiI 3 Si SiAl 3 Pr PrB 3 Pd Pdb 3 Pb Pbl 3 Pb Pba 3 P PX 3 O Ol 3 O Oe 3 O OVI 3 O ODz 3 O ODw 3 O OCz 3 O OBz 3 Ni NieZ 3 Ni NicZ 3 Ni NicY 3 Ni NibZ 3 Na Naint 3 Na NaY 3 Na NaChO 3 Na NaAP 3 Ni NII 3 Mn Mnc 3 Mn MnFe 3 Mg MgfZ 3 Mg MgeZ 3 Mg MgdZ 3 Mg MgcZ 3 Mg MgcY 3 Mg MgbZ 3 Mg MgbY 3 Mg MgaZ 3 Mg MgaY 3 Mg MgMn 3 Mg MgG 3 Li LiPU 3 K KN 3 K KMi 3 K KE 3 H HN 3 Fe FedZ 3 Fe FebY 3 Fe FeaY 3 Fe FeII 3 Er ErM 3 D Dha 3 D Dh 3 Cu CuMn 3 Cr CrfZ 3 Cr CrcY 3 Cr CrbY 3 Cl Cll 3 Cl ClB 3 Ca Caa 3 Ca CaP 3 Ca CaF 3 Br BrX 3 Be Bel 3 Be BeSi 3 Ba BaX 3 Ba BaNa 3 Ba BaBA 3 As Asl 3 As AsB 3 Al AlfZ 3 Al AleZ 3 Al AldZ 3 Al AlcZ 3 Al AlcY 3 Al AlbZ 3 Al AlbY 3 Al AlaZ 3 Al AlaY 3 Al AlMn 3 Nb nb 2 K k 2 Zr ZrSi 2 Zr ZrC 2 Zn ZnD 2 Zn ZnB 2 Yb YbNa 2 O Watd 2 O WatY 2 O WatN 2 O WatK 2 O WatBb 2 O WatAIII 2 W WO 2 W WB 2 V VO 2 V VMo 2 V VB 2 Tl TlE 2 Tl TlD 2 Ti Tia 2 Ti TiO 2 Th ThB 2 Ta TaD 2 Th TH 2 Tb TB 2 Sr Srz 2 Sr SrF 2 Sr SrCa 2 Sr SrBP 2 S SpHS 2 Sn SnY 2 Sn SnD 2 Si SiTD 2 Si SiTC 2 Si SiS 2 Sc ScX 2 Sb SbT 2 Sb SbIV 2 Sb SbIII 2 Sb SbII 2 Sb SbI 2 Sm SM 2 Si SIV 2 Si SIII 2 Si SII 2 Rh RhII 2 Rh RhI 2 Re ReM 2 Rb RbMi 2 Pd Pdp 2 P PP 2 Pm PMo 2 O Owat 2 O Odozo 2 O Odozi 2 O Odooo 2 O Odooi 2 O Odmzo 2 O Odmzi 2 O Odmoo 2 O Odmoi 2 O Ocozo 2 O Ocozi 2 O Ocooo 2 O Ocooi 2 O Ocmzo 2 O Ocmzi 2 O Ocmoo 2 O Ocmoi 2 O Obozo 2 O Obozi 2 O Obooo 2 O Obooi 2 O Obmzo 2 O Obmzi 2 O Obmoo 2 O Obmoi 2 O OZ 2 O OWD 2 O OVIII 2 O OVII 2 O OOF 2 O ODmz 2 O OCmz 2 O OBmz 2 Ni NiMT 2 Ni NiMO 2 Nd NdX 2 Nd NdP 2 Na Nab 2 Na NaW 2 Na NaMi 2 Na NaMe 2 Na NaIII 2 Na NaCH 2 Na NaCA 2 Na NaAI 2 Na NaAF 2 N NC 2 Mo Mol 2 Mn MnU 2 Mn MnOc 2 Mn MnO 2 Mn MnMII 2 Mn MnAl 2 Mg Mgt 2 Mg Mgb 2 Mg Mga 2 Mg MgOc 2 Mg MgFe 2 Mg MgF 2 Li LiZ 2 Li LiX 2 K Kext 2 K KAIII 2 K KAII 2 Hg HgCO 2 Ge Gel 2 Ge GeY 2 Gd GdX 2 Gd GdAP 2 F Fphi 2 F Fo 2 Fe Fet 2 Fe FeU 2 Fe FeMn 2 Fe FeG 2 F FZ 2 F FOH 2 F FOF 2 F FF 2 F FA 2 Eu EuM 2 Dy DyX 2 Cu CuOT 2 Cu CuIT 2 Cs Csphi 2 Cr CrT 2 Cr CrD 2 Cr CrA 2 Cl Clm 2 Cl ClZ 2 Ce CeX 2 Ce CeN 2 Ca Caz 2 Ca Cax 2 Ca CaSr 2 Ca CaO 2 Ca CaNac 2 Ca CaN 2 Ca CaMi 2 Ca CaAT 2 Ca CaAF 2 C CT 2 C CCh 2 C CB 2 Br Brl 2 B Bl 2 Bi Bib 2 Bi BiA 2 Be BeB 2 Ba Bal 2 Ba BaD 2 Ba BaBa 2 Ba BaBP 2 As AsP 2 As AsII 2 As AsI 2 Al Alt 2 Al AlTC 2 Al AlSi 2 Al AlS 2 Al AlOc 2 Al AlO 2 Al AlMl 2 Al AlG 2 Ag AgMl 2 As AS 2 V v 1 Ta ta 1 Sn sn 1 Sc sc 1 Sb sb 1 Re re 1 Zr Zrm 1 Zr ZrP 1 Zr ZrNb 1 Zr ZrFe 1 Zn ZnX 1 Zn ZnO 1 Zn ZnII 1 Zn ZnI 1 ? Z 1 Yb YbY 1 Yb YbX 1 Y Ya 1 Y YNa 1 Y YN 1 Y YMl 1 Y YIII 1 Y YII 1 Y YI 1 Y YCe 1 Y YCa 1 Y YC 1 Yb YB 1 O Watlb 1 O Watla 1 O Watb 1 O WatWl 1 O WatVIII 1 O WatVII 1 O WatVI 1 O WatV 1 O WatIX 1 O WatIII 1 W WA 1 V VZ 1 U UY 1 Tm TmX 1 Tl Tll 1 Tl TlM 1 Tl TlI 1 Ti Til 1 Ti Tic 1 Ti Tib 1 Ti TiTi 1 Ti TiSi 1 Ti TiP 1 Ti TiFe 1 Th ThD 1 Tb TbA 1 Ta TaMO 1 Te TE 1 Tc TC 1 Sr Sri 1 Sr SrK 1 Sr SrD 1 Sn SnTi 1 Sm SmX 1 Si Sio 1 Si Silo 1 Si Silm 1 Si SiTdC 1 Si SiTdB 1 Si SiTdA 1 Si SiNb 1 Si SiBed 1 Si SiBec 1 Si SiBeb 1 Si SiBea 1 Se Sej 1 Se Sei 1 Se Sef 1 Sc ScT 1 Sc ScB 1 Sb Sbb 1 Sb Sba 1 S Sa 1 S STe 1 Ru RuM 1 Rh RhM 1 Rh RhB 1 Rh RhA 1 Re Rel 1 Pt PtL 1 Pr PrREE 1 Pr PrNa 1 Pr PrCad 1 Pr PrCac 1 Pr PrCab 1 Pr PrCaa 1 Pd Pdm 1 Pd Pde 1 Pd Pdd 1 Pd PdB 1 Pd PdA 1 Pb Pbb 1 Pb PbBa 1 Pb PbB 1 Pm PM 1 P PG 1 Pa PAs 1 Os OsM 1 O Or 1 O Ollb 1 O Oll 1 O Ola 1 O OlB 1 O Of 1 O Oeq 1 O OY 1 Os OSl 1 O OO 1 O OK 1 O OIIIB 1 O OIIIA 1 O OFd 1 O OFc 1 O OFb 1 O OFa 1 O ODzi 1 O OCzi 1 O OCl 1 O OCC 1 O OCB 1 O OCA 1 O OBzi 1 Ni Nil 1 Ni NiY 1 Ni NiX 1 Ni NiB 1 Nd NdZ 1 Nd NdREE 1 Nd NdCad 1 Nd NdCac 1 Nd NdCab 1 Nd NdCaa 1 Nb Nba 1 Na Namid 1 Na Nam 1 Na Naext 1 Na Nad 1 Na Nac 1 Na Naa 1 Na NaD 1 Na NaCl 1 Na NaCho 1 Na NaCal 1 Na NaBa 1 Na NaBA 1 Na NaAII 1 Ni NIA 1 N NCh 1 Mo MoT 1 Mn Mnx 1 Mn Mnl 1 Mn MnS 1 Mn MnOcb 1 Mn MnOca 1 Mn MnMla 1 Mn MnMl 1 Mn MnK 1 Mn MnG 1 Mg Mgx 1 Mg MgMg 1 Mg MgMII 1 Mg MgIII 1 Mg MgCu 1 Mg MgCA 1 Mg MgBA 1 Lu LuX 1 Li Lil 1 Li LiD 1 Li LiC 1 Li LiB 1 Li LiAm 1 La LaX 1 La LaREE 1 La LaRE 1 La LaN 1 La LaMl 1 La LaCad 1 La LaCac 1 La LaCab 1 La LaCaa 1 K Kphi 1 K Kmid 1 K Kint 1 K KWl 1 K KW 1 K KOH 1 K KMe 1 K KCl 1 K KCA 1 K KBA 1 Ir IrM 1 H Hwl 1 Ho HoX 1 Ho HoA 1 Hg Hgl 1 Hg HgOH 1 Hf HfY 1 Hf HfB 1 H HZ 1 Hs HS 1 Ho HOh 1 Ge GeGa 1 Ga GaZ 1 Ga GaY 1 F Fw 1 F Fl 1 Fe Fey 1 Fe Fex 1 Fe Fed 1 Fe Fec 1 Fe FeS 1 Fe FeMla 1 Fe FeMl 1 Fe FeMg 1 Fe FeMc 1 Fe FeMII 1 Fe FeIII 1 Fe FeIIB 1 Fe FeIIA 1 Fe FeCu 1 F Fd 1 F Fc 1 F Fb 1 F Fa 1 F FII 1 F FI 1 F FD 1 F FB 1 Eu EuA 1 Er Erl 1 Er ErX 1 Dy Dya 1 Dy DyN 1 Dy DyMH 1 Dy DY 1 Cu Cub 1 Cu CuOH 1 Cu CuMg 1 Cu CuCO 1 Cs Csl 1 Cs CsY 1 Cr Cry 1 Cr CrO 1 Co Col 1 Co CoB 1 Co CoA 1 Cl Clll 1 Cl Cllb 1 Cl Clla 1 Cl Clhcl 1 Cl Clb 1 Cl Cla 1 Cl ClAC 1 Ce Cen 1 Ce CeY 1 Ce CeREE 1 Ce CeRE 1 Ce CeMl 1 Ce CeCad 1 Ce CeCac 1 Ce CeCab 1 Ce CeCaa 1 Ce CeC 1 Cd CdX 1 Cd CdF 1 Cd CdE 1 Cd CdD 1 Cd CdC 1 C Cb 1 Ca Cai 1 Ca Cad 1 Ca CaREE 1 Ca CaNab 1 Ca CaNaa 1 Ca CaNA 1 Ca CaMn 1 Ca CaMla 1 Ca CaMl 1 Ca CaIII 1 Ca CaFe 1 Ca CaE 1 Ca CaCl 1 Ca CaCa 1 Ca CaCA 1 Ca CaAl 1 Ca CaAM 1 C CX 1 C CTA 1 Cm CMe 1 C CG 1 C CC 1 Br Brll 1 Br BrB 1 Br BrA 1 Bi Bil 1 Bi BiMl 1 Be Bed 1 Be Bec 1 Be Beb 1 Be Bea 1 Be BeA 1 Ba BaI 1 Ba BaE 1 Ba BaCl 1 Ba BaCa 1 Ba BaAIII 1 Bi BII 1 As AsSb 1 As AsC 1 As AsAs 1 As AsA 1 Ar ArCh 1 Al Aly 1 Al Aloc 1 Al Allo 1 Al Allm 1 Al AlTD 1 Al AlMg 1 Al AlMc 1 Al AlII 1 Al AlI 1 Al AlFe 1 Al AlC 1 Al AlAl 1 Ag Agl 1 Ag AgX 1 Ag AgME 1 Ag AgII 1 Ag AgI 1 Ag AgA 1 -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 213 bytes Desc: OpenPGP digital signature URL: From thomas at tdsonline.nl Tue Jan 14 11:33:52 2020 From: thomas at tdsonline.nl (Thomas Dortmann) Date: Tue, 14 Jan 2020 10:33:52 +0100 Subject: [Cod-bugs] COD conversion with HighScore In-Reply-To: <98d13001-c538-8374-cf22-9d094d40cc51@ibt.lt> References: <302131fd-91c5-7fde-1425-25711c20462d@ibt.lt> <24528fa0-cec1-0350-82c9-394845bc9b39@ibt.lt> <98d13001-c538-8374-cf22-9d094d40cc51@ibt.lt> Message-ID: Hi Saulius, thanks a lot for your answer! The COD conversion runs nearly 100% now - for the water (oxygen) molecules addresses by us. There are many more problematic atom_names, as shown in your lists. Our heuristics is slightly different from yours and does interpret many of the problematic names too, but if this interpretation is right or wrong nobody knows. Next to your lists provided in the email we also experience many atom_names like ?, n or simply a dot or an empty space, which are not interpreted at all by us. I fully agree with your plans a) and b)! In the end no heuristics is error proof. It will be a big step forwards and a great improvement on the COD quality, when as many CIF's as possible carry atom_symbols next to the atom_names. We will fix the "OWat(n)" atom_name and hope to have our conversion ready by the beginning of next week. best regards, Thomas Dortmann On Mon, Jan 13, 2020 at 8:21 PM Saulius Gra?ulis wrote: > Dear Thomas, > > On 2020-01-13 12:43, Thomas Dortmann wrote: > > a) concerning atom-labels: > > > > we fixed the "Wat" atom-label in our conversion and now the number of > pattern containing the (wrong!) element Astatine is down from 1180 to 8! > > These eight remaining patterns are all minerals, where water is coded as > "OWat(n)" instead of "Wat"; > > these are the corresponding CIF's: > > 9006364 - 9006368 > > 9014246 > > 9014312 > > 9016377 > > This is good news. I'm very happy that your conversion software runs > nearly 100%! > > On my side, I went through the "uncanonical" atom names in the COD > ( > http://saulius-grazulis.lt/~saulius/.d981490889b10e82e8f6943bbfd569aaebf1c8c3/ > ). > In the file "estimated-atom-types.lst", the first column is the > estimated atom type, the second is the atom name in the corresponding > CIF, and the third is the number of occurrences of this atom name. > > The "DOUBLE_CHECK.lst" contains a manually compiled list of atom types > that are most probably wrong after automatic detection and will need to > be inspected by a human. > > The policy I would adopt is the following: > > a/ If a CIF already contains _atom_site_type_symbol, we do nothing. > > Reason: the _atom_site_type_symbol is either added manually by COD > curators (in this case we do not want to undo our manual work), or it is > provided by CIF authors. Among the atom type symbols, most common > irregularity is the symbols in all lowercase, or the symbols in all > uppercase. These can be dealt by regularising case an looking up in a > table; e.g. we do: > > ucfirst(lc($atom_type_symbol)), > > where lc($string) returns all-lowercase version of the argument string, > and ucfirst($string) returns the string with the first letter > uppercases, yielding "Ca" from both "ca" and "CA", which is mostly > correct. From 14742 atoms in the COD that have _atom_site_type_symbol > values, only 28 could not be interpreted in this way ? a negligible > amount, and non-correctable even manually. > > Since, as I understand, you software already incorporates this > heuristics, atoms with _atom_site_type_symbol will not be a problem, > will it? > > b/ If the atom does *not* have the _atom_site_type_symbol, we will guess > its type from the atom label. If the leading non-digit characters of the > atom label yield a valid periodic system element name, we do nothing. > > If the leading non-digit characters of the atom site label do *not* > yield a recognisable atom name, we apply heuristics as noted in the > estimated-atom-types.lst.log in the Web page cited above; in Perl: > > $n1 = ucfirst(substr($atom_site_label,0,1)); > $n2 = ucfirst(lc(substr($atom_site_label,0,2))); > > if( $atom_site_label =~ /^Wat[A-Za-z0-9\(\)]*$/ ) { > $atom_site_type_symbol = "O" > } elsif( exists $COD::AtomProperties::atoms{$n2} ) { > $atom_site_type_symbol = $n2 > } elsif( exists $COD::AtomProperties::atoms{$n1} ) { > $atom_site_type_symbol = $n1 > } else { > $atom_site_type_symbol = "?"; > print STDERR "$0: WARNING, atom type for atom \"$F[1]\" is not > recognised\n" > } > > We then compute the summary formula with the new atom types, and compare > it with the formula provided by the authors. If the summary formulae > match, we add the _atom_site_type_symbol to the CIF. If not, we report > an error. > > After this, we double-check the atom types mentioned in DOUBLE_CHECK.lst. > > The new modified CIFs will have recognisable (standard) element names in > _atom_site_type_symbol, and will have correct chemical formula > computable from atom records (correct means the same as provided by the > author). The results will be like to those in estimated-atom-types.lst. > > The new CIFs may only break the heuristics in you program if: > > 1/ we guess the atom types wrongly, > 2/ the authors provided an incorrect summary chemical formula > 3/ the two incorrect formulas match by pure accident, > 4/ your heuristics gets atoms types correctly. > > or > > 1'/ we make two mistakes that compensate each other exactly (e.g. Ca->C > on one site, and C->Ca on another site) and still get the correct > formula with incorrect atom site assignments. > > I regard coincidence of these events highly unlikely. Also, when > detected, the _atom_site_type_symbol values can be curated manually and > will *not* be overridden again by automatic software. > > If you find such COD curation policy acceptable, we proceed with its > implementation at some time in the future, and add it to our automatic > pipelines (but without the manual check stage for every incoming file...). > > I CC this e-ail to the COD AB for discussion and eventual policy approval. > > Regards, > Saulius > > -- > Dr. Saulius Gra?ulis > Vilnius University Institute of Biotechnology, Saul?tekio al. 7 > LT-10257 Vilnius, Lietuva (Lithuania) > fax: (+370-5)-2234367 / phone (office): (+370-5)-2234353 > mobile: (+370-684)-49802, (+370-614)-36366 > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: