From voleinikovas at monterosatx.com Tue Nov 15 15:22:16 2022 From: voleinikovas at monterosatx.com (Vladas Oleinikovas) Date: Tue, 15 Nov 2022 13:22:16 +0000 Subject: [Cod-bugs] Number of entries in smiles.txt do not match cif entries. Message-ID: Hi! Firstly, thanks for an amazing repo and great documentation! I have recently downloaded COD using command: >wget http://www.crystallography.net/archives/cod-cifs-mysql.zip After unzipping I found cif and mysql directories ? as expected. Looking at files in mysql entries I caught interest of smiles.txt file. This looks very useful for searching the molecules of interest, especially the organic ones, that I am interested. I assume this relates to this paper (https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0279-6), is that correct? Counting entries in this file, however, I find the number of entries significantly smaller than the reported number of entries on the title page (?Currently there are 494800 entries in the COD?): ~/COD/mysql:> wc -l smiles.txt > 219646 smiles.txt Is this because the file is not being updated, or does that exclude entries that were unable to be converted into SMILES? Many thanks for your reply! Best wishes, Vladas P.S. Feel free to answer in Lithuanian, if preferred ? -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: From antanas.vaitkus90 at gmail.com Tue Nov 15 16:12:38 2022 From: antanas.vaitkus90 at gmail.com (Antanas Vaitkus) Date: Tue, 15 Nov 2022 16:12:38 +0200 Subject: [Cod-bugs] Number of entries in smiles.txt do not match cif entries. In-Reply-To: References: Message-ID: Dear Vladas, On Tue, 15 Nov 2022 at 15:32, Vladas Oleinikovas < voleinikovas at monterosatx.com> wrote: > Hi! > > Firstly, thanks for an amazing repo and great documentation! > It is good to hear that you find the COD useful. I have recently downloaded COD using command: > >wget http://www.crystallography.net/archives/cod-cifs-mysql.zip > After unzipping I found cif and mysql directories ? as expected. > > Looking at files in mysql entries I caught interest of smiles.txt file. > This looks very useful for searching the molecules of interest, especially > the organic ones, that I am interested. I assume this relates to this paper > (https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0279-6), > is that correct? > Yes, the paper describes the overall workflow used to create the SMILES strings as well as the conventions employed to represent various compounds which do not fit well in the bond valence model that the SMILES format is based on. > Counting entries in this file, however, I find the number of entries > significantly smaller than the reported number of entries on the title page > (?Currently there are 494800 entries in the COD?): > ~/COD/mysql:> wc -l smiles.txt > > > 219646 smiles.txt > > Is this because the file is not being updated, or does that exclude > entries that were unable to be converted into SMILES? > Since the COD SMILES strings are generated semi-manually by one of our volunteer chemists (for more details see the paper you cited earlier), the overall process is quite slow. The SMILES dataset is still routinely updated and hopefully will eventually cover a more significant part of the COD. We are also working on a more automated approach for deriving chemical descriptions from crystallographic data (CIF -> SMILES, SDF, DWAR, etc.) which will provide an alternative way of searching for chemical compounds in the COD. The manuscript is still in preparation, but I can send you a link to the paper once it is in the published if you are interested. > Many thanks for your reply! > Hopefully this answers your question. Please let me know if you have any further questions or comments. > > Best wishes, > Vladas > > P.S. Feel free to answer in Lithuanian, if preferred ? > I do prefer Lithuanian, but decided to reply in English in case I need to answer the same question to a non-Lithuanian speakers in the future. Sincerely, Antanas Vaitkus The mailing list > > -- > This message has been scanned for viruses and > dangerous content by *MailScanner* , and is > believed to be clean. > _______________________________________________ > Cod-bugs mailing list > Cod-bugs at lists.crystallography.net > http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs > -- Antanas Vaitkus, Vilnius University, Life Sciences Center, Institute of Biotechnology, room C521, Saul?tekio al. 7, LT-10257 Vilnius, Lithuania -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: From voleinikovas at monterosatx.com Tue Nov 15 16:16:05 2022 From: voleinikovas at monterosatx.com (Vladas Oleinikovas) Date: Tue, 15 Nov 2022 14:16:05 +0000 Subject: [Cod-bugs] Number of entries in smiles.txt do not match cif entries. In-Reply-To: References: Message-ID: Sveiki, A?i? u? i?sam? atsakym?. Lauksiu ?ini? d?l naujos publikacijos ? Iki greito, Vladas From: Antanas Vaitkus Date: Tuesday, 15 November 2022 at 15:12 To: Vladas Oleinikovas Cc: cod-bugs at ibt.lt Subject: Re: [Cod-bugs] Number of entries in smiles.txt do not match cif entries. Dear Vladas, On Tue, 15 Nov 2022 at 15:32, Vladas Oleinikovas > wrote: Hi! Firstly, thanks for an amazing repo and great documentation! It is good to hear that you find the COD useful. I have recently downloaded COD using command: >wget http://www.crystallography.net/archives/cod-cifs-mysql.zip After unzipping I found cif and mysql directories ? as expected. Looking at files in mysql entries I caught interest of smiles.txt file. This looks very useful for searching the molecules of interest, especially the organic ones, that I am interested. I assume this relates to this paper (https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0279-6), is that correct? Yes, the paper describes the overall workflow used to create the SMILES strings as well as the conventions employed to represent various compounds which do not fit well in the bond valence model that the SMILES format is based on. Counting entries in this file, however, I find the number of entries significantly smaller than the reported number of entries on the title page (?Currently there are 494800 entries in the COD?): ~/COD/mysql:> wc -l smiles.txt > 219646 smiles.txt Is this because the file is not being updated, or does that exclude entries that were unable to be converted into SMILES? Since the COD SMILES strings are generated semi-manually by one of our volunteer chemists (for more details see the paper you cited earlier), the overall process is quite slow. The SMILES dataset is still routinely updated and hopefully will eventually cover a more significant part of the COD. We are also working on a more automated approach for deriving chemical descriptions from crystallographic data (CIF -> SMILES, SDF, DWAR, etc.) which will provide an alternative way of searching for chemical compounds in the COD. The manuscript is still in preparation, but I can send you a link to the paper once it is in the published if you are interested. Many thanks for your reply! Hopefully this answers your question. Please let me know if you have any further questions or comments. Best wishes, Vladas P.S. Feel free to answer in Lithuanian, if preferred ? I do prefer Lithuanian, but decided to reply in English in case I need to answer the same question to a non-Lithuanian speakers in the future. Sincerely, Antanas Vaitkus The mailing list -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. _______________________________________________ Cod-bugs mailing list Cod-bugs at lists.crystallography.net http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs -- Antanas Vaitkus, Vilnius University, Life Sciences Center, Institute of Biotechnology, room C521, Saul?tekio al. 7, LT-10257 Vilnius, Lithuania -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: From grazulis at ibt.lt Tue Nov 15 17:24:33 2022 From: grazulis at ibt.lt (=?UTF-8?Q?Saulius_Gra=c5=beulis?=) Date: Tue, 15 Nov 2022 17:24:33 +0200 Subject: [Cod-bugs] {Disarmed} Deposition of our prepublication XRD data to COD In-Reply-To: References: Message-ID: Dear Milad, thank you for your interest in COD. Yes, I am aware about Iranian situation; I am greatly dismayed to hear that Iran blocks the Internet access and cracks down upon citizens. I regret to tell you the the files that you have sent us are not suitable for the COD database (and this is the reason why they failed the validation). COD accepts only structures with atomic coordinates in CIF format [1,2]. As I understood from the files that you have sent us, you conducted a powder diffraction experiment. To get atomic coordinates from such data, phasing, modelling and Rietveld refinement of the structure should be done. Guidelines of the IUCr procedures should be followed [3]. COD does not accept plain powder diffraction traces for deposition. To deposit just a powder diffraction trace in a public repository you may consider uploading your files to Zenodo or Data Dryad. But even then you need to add a great deal of metadata to you diffraction data. As a bare minimum, you need to specify what was your sample and how it was obtained; the model and type of you measurement instrument; instrument settings (wavelength, X-ray tube voltage and current, slit settings, detector type, monochromator type, etc.); specify when, where and by whom the data were collected. For administrative purposes, you may want to specify the institution address where the measurement was performed and the funding source. Needless to say, data must be in some standardised format (e.g. CIF); original files from the instrument should also be provided. Plain Matlab files are not the best way to distribute you measurements. Hope this helps... Sincerely, Saulius Refs.: [1] Hall, S. R.; Allen, F. H. & Brown, I. D.The crystallographic information file (CIF): a new standard archive file for crystallography/Acta Crystallographica Section A,/*1991*/, 47/, 655-685, DOI: https://doi.org/10.1107/S010876739101067X [2] IUCr. Crystallographic Information Framework *2022*, URL: https://www.iucr.org/resources/cif; accessed 2022-11-15T14:28+02:00. [3] David, W. I. F.; Shankland, K.; McCusker, L. B. & Baerlocher, C. Structure Determination from Powder Diffraction Data Oxford University Press, 2002, URL: https://isbnsearch.org/isbn/0199205531 . On 2022-11-07 19:06, Milad Rasouli wrote: > > Dear Sir or Madam, > > > I hope you are well! > > > I am writing to you regarding the deposition of our prepublication > data to COD. > > > I am trying to add our XRD data to the Crystallography Open Database, > but I get an error when I try to validate the data. > > Please find enclosed the proposed file for deposition in the > Crystallography Open Database. > > > As you may know, Iran's government blocked internet access as the > protests grew. Therefore, we have limited access to the internet. > > > Could you please let me know what I should do? > > > This step should be completed as part of the process of publishing our > article in Scientific Reports. Therefore, your kind response at your > earliest convenience will be greatly appreciated! > > > All the best, > Milad Rasouli > > > > Web Bug from > https://mailtrack.io/trace/mail/5ef97558e7bf0c41ae684819b5c1ad3ddb5cbd16.png?u=6819307 > > -- > This message has been scanned for viruses and > dangerous content by *MailScanner* , and is > believed to be clean. > > _______________________________________________ > Cod-bugs mailing list > Cod-bugs at lists.crystallography.net > http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs -- Dr. Saulius Gra?ulis Vilnius University, Life Science Center, Institute of Biotechnology Saul?tekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania) phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From grazulis at ibt.lt Tue Nov 15 17:56:14 2022 From: grazulis at ibt.lt (=?UTF-8?Q?Saulius_Gra=c5=beulis?=) Date: Tue, 15 Nov 2022 17:56:14 +0200 Subject: [Cod-bugs] cif corrections In-Reply-To: References: Message-ID: <3769e404-73a3-4b73-5e44-8ffd7966ec4a@ibt.lt> Dear William, thank you very much for your e-mail! Your reports are very valuable, because they show us what the requirements and priorities of COD users are. The problems that you have identified are indeed problems with the COD data, mostly stemming from the original publications. These issues, as well as some other ones, can detected by using CIF validation software [1]. The problem is that there is a large amount (> 11 mnl.) of these validation issues [2], so we can not fix them on the spot; we are slowly going through the COD one file at a time and fixing them, quite often manually. In many cases we need to consult original publications to see what the authors' intent really was, since the main principle of the COD is "do not invent data". It would be great of you could send us your processing logs with all entries; we would use it as a guide on what types of errors to treat first, and also I would be interested to see if you catch more errors than we do. So yes, please send us the remaining entry list of that is possible. Of course I can not promise that we fix them immediately, but at least we put them on the top of our priority queue ;). Also, some problems might be impossible to fix; e.g. the COD entry 2005961 indeed has a broken _atom_site_aniso list, but the same problem is detected also in the original supplementary data, and the paper itself does not contain Uij list (as far as I could see). In this case we can only mark the entry as "having problems", and suggest the users to use only the XYZ coordinates and Uiso values instead; or try to contact the authors and ask them for a correction or for Fobs data so that the Uij values can be refined anew. Sincerely yours, Saulius Refs.: [1] Vaitkus A. Validation messages for the COD entry 1506432. URL: https://sql.crystallography.net/db/cod_validation/validation_issue?offset=0&rows=100&filter=%28cod_id%20%3D%20%221506432%22%29 [2] Vaitkus, A. COD validation issue database. 2021, URL: https://sql.crystallography.net/db/cod_validation [accessed 2022-11-15T17:39+02:00]. NOTE: the page is slow to load, please be patient! On 2022-11-07 20:47, William Lenthe wrote: > > I read through the first 100k entries of the COD to test some cif > parsing code and believe I found a few errors > > 1506432: line 186 >F40 should be F40 > > 1506503: line 200 >F1' should be F1' > > 2005854: lines 160-162 are human readable but non-standard (should be > split into 2 lines each or maybe the more common As1/P1) > > 2005923: line 176 _atom_site_aniso_label is '5' (maybe it should be O5?) > > 2005926: line 179 label is 04 instead of O4 (number zero instead of > letter O) > > 2005961: _atom_site_aniso_label loop appears to be malformed > > I generated another ~75 warnings if useful, they are mostly issues > with case consistency or atoms listed in _atom_site_aniso_label that > don?t appear in _atom_site_label > -- Dr. Saulius Gra?ulis Vilnius University, Life Science Center, Institute of Biotechnology Saul?tekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania) phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: From sam.mcdonald at duke.edu Tue Nov 15 19:44:03 2022 From: sam.mcdonald at duke.edu (Sam McDonald) Date: Tue, 15 Nov 2022 17:44:03 +0000 Subject: [Cod-bugs] Question about Parameters Message-ID: Hi! I am working with the csv of your data and was wondering if you could tell me what: Rall, Robs, Rref, wRall, wRobs, wRref, RFsqd, RI, gofref, nel, sg, sgHall, and sgNumber are? I searched through your website and had trouble finding them but I am very sorry if I missed their definitions somewhere. Thanks so much! Sam McDonald Ph.D. Candidate | Duke University Matthew Becker's Lab | Department of Chemistry -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: From antanas.vaitkus90 at gmail.com Tue Nov 15 20:51:03 2022 From: antanas.vaitkus90 at gmail.com (Antanas Vaitkus) Date: Tue, 15 Nov 2022 20:51:03 +0200 Subject: [Cod-bugs] Question about Parameters In-Reply-To: References: Message-ID: Dear Sam, The fields in the CSV file correspond to the fields of the 'data' table from the COD SQL database. The 'data' table fields are described in detail in the following XML file: https://www.crystallography.net/cod/xml/documents/database-description/database-description.xml . Please let me know if this sufficiently answers your question or if you have any further comments. Sincerely, Antanas On Tue, 15 Nov 2022 at 20:31, Sam McDonald wrote: > Hi! > I am working with the csv of your data and was wondering if you could tell > me what: Rall, Robs, Rref, wRall, wRobs, wRref, RFsqd, RI, gofref, nel, > sg, sgHall, and sgNumber are? > > I searched through your website and had trouble finding them but I am very > sorry if I missed their definitions somewhere. > > Thanks so much! > > *Sam McDonald* > Ph.D. Candidate | Duke University > Matthew Becker's Lab | Department of Chemistry > > > > -- > This message has been scanned for viruses and > dangerous content by *MailScanner* , and is > believed to be clean. > _______________________________________________ > Cod-bugs mailing list > Cod-bugs at lists.crystallography.net > http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs > -- Antanas Vaitkus, Vilnius University, Life Sciences Center, Institute of Biotechnology, room C521, Saul?tekio al. 7, LT-10257 Vilnius, Lithuania -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrius.merkys at gmail.com Tue Nov 15 21:05:59 2022 From: andrius.merkys at gmail.com (Andrius Merkys) Date: Tue, 15 Nov 2022 21:05:59 +0200 Subject: [Cod-bugs] Question about Parameters In-Reply-To: References: Message-ID: Hello, On Tue, 15 Nov 2022, 20:51 Antanas Vaitkus, wrote: > The fields in the CSV file correspond to the fields of the 'data' table > from the COD SQL database. The 'data' table fields are described in detail > in the following XML file: > https://www.crystallography.net/cod/xml/documents/database-description/database-description.xml > . > Just to add to Antanas's answer, there is a bit more human-readable page if XML is not an option, but it is less detailed: http://wiki.crystallography.net/cod_mysql_schema/ Hope this helps, Andrius > -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: From antanas.vaitkus90 at gmail.com Tue Nov 15 21:13:40 2022 From: antanas.vaitkus90 at gmail.com (Antanas Vaitkus) Date: Tue, 15 Nov 2022 21:13:40 +0200 Subject: [Cod-bugs] Question about Parameters In-Reply-To: References: Message-ID: Hi, On Tue, 15 Nov 2022 at 21:06, Andrius Merkys wrote: > Hello, > > On Tue, 15 Nov 2022, 20:51 Antanas Vaitkus, > wrote: > >> The fields in the CSV file correspond to the fields of the 'data' table >> from the COD SQL database. The 'data' table fields are described in detail >> in the following XML file: >> https://www.crystallography.net/cod/xml/documents/database-description/database-description.xml >> . >> > > Just to add to Antanas's answer, there is a bit more human-readable page > if XML is not an option, but it is less detailed: > > http://wiki.crystallography.net/cod_mysql_schema/ > I also thought of linking it, but it seems that we have not properly updated this website for some time now, hence the Rall, Robs, Rref, etc. fields are currently not described there at all. Sincerely, Antanas > -- > This message has been scanned for viruses and > dangerous content by *MailScanner* , and is > believed to be clean. _______________________________________________ > Cod-bugs mailing list > Cod-bugs at lists.crystallography.net > http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs > -- Antanas Vaitkus, Vilnius University, Life Sciences Center, Institute of Biotechnology, room C521, Saul?tekio al. 7, LT-10257 Vilnius, Lithuania -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrius.merkys at gmail.com Wed Nov 16 16:33:17 2022 From: andrius.merkys at gmail.com (Andrius Merkys) Date: Wed, 16 Nov 2022 16:33:17 +0200 Subject: [Cod-bugs] Question about Parameters In-Reply-To: References: Message-ID: <108827b5-c0d8-a0ed-0154-2539eaac1282@gmail.com> Hi, On 2022-11-15 21:13, Antanas Vaitkus wrote: > I also thought of linking it, but it seems that we have not properly > updated this website for some time now, hence the Rall, Robs, Rref, etc. > fields are currently not described there at all. Right, I also thought that it might be outdated. Thus the XML file is closer to what we have in the SQL/CSV. Thanks, Andrius -- Andrius Merkys Vilnius University Institute of Biotechnology, Saul?tekio al. 7 LT-10257 Vilnius, Lithuania -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. From grazulis at ibt.lt Fri Nov 18 19:23:42 2022 From: grazulis at ibt.lt (=?UTF-8?Q?Saulius_Gra=c5=beulis?=) Date: Fri, 18 Nov 2022 19:23:42 +0200 Subject: [Cod-bugs] cif corrections In-Reply-To: References: <3769e404-73a3-4b73-5e44-8ffd7966ec4a@ibt.lt> Message-ID: <727e4d6b-76d0-3739-0019-d377bf843736@ibt.lt> Dear William, thank you very much for the answer and for the COD processing logs! They are indeed very useful for us. Below, I give some comments on the kinds of issues your software has found. Some are hard to fix and reflect the real situation in the crystal, or the choice and opinion of the structure author. Some, however, are serious problems which we missed and which we will have to investigate. Details below: On 2022-11-17 18:13, William Lenthe wrote: > Please find a list of the errors generated by my parsing of the database. I've manually removed a few types of errors that are due to limitations in my code. For example I don't handle modulated structures (I count 285 in the database) I found 287 in the recent revision of the COD, which is probably the same as yours with just some new structures. Confirmed. > or support cifs without a space group (15), lattice parameters (10), and at least 1 atom (there are so many of these I didn't count). I've also removed 286 instances of an error message like: > > 1501515.cif: _atom_type_symbol 'Ti4+' not found in _atom_site_label loop I get similar counts; confirmed. The _atom_type_symbol needs to be fixed systematically, somehow, but there are 115772 such messages, so hard to go through manually. The '_atom_site_type_symbol' value Ti5+ seems to be a mismatch with the '_atom_type_symbol' value 'Ti4+'; this is probably a mistake but we need to check the original paper before correcting. > Since it looks like the database you send already catches that type. I left in the case sensitive versions since they should be easy fixes. Finally also have a list of ~35k warnings that are almost all one of these types: > > 1006173.cif: clamped site O2 occupancy from 1.005(5) to 1. Here, the occupancies seem to be refined. The occupancies are both within error margins of 1.0. The occupancies of oxygens O1 and O2 add up to 2.0, so clamping one but not adjusting the other will give slightly incorrect total oxygen count in the structure. Whether this is an issue will depend on your application, of course... We'll leave these values as they are in the COD since this is how authors reported the structure; of they refined the occupancies we need to leave indication that this was done so. > 1000495.cif: merging equivalent positions to 0.365000 0.365000 0.375000 with total occupancy 0.25 from sites labeled {Cs1, Cs2}. Specifying multiple atoms at exactly the same site seems to be an accepted way to represent occupational disorder. I would say this is a feature, not a bug ;) > 1008070.cif: space group is triclinic but lattice constants are orthorhombic. Monoclinic space groups can gave (nearly) any angles, including 90.0 degree angles. It would be of course strange to see monoclinic cell with all right angles just by accident, but in this case the abstract of the paper [2] says: "For /x/ > 0.9, these compounds have an orthorhombic symmetry (/O/) if the cations are disordered, while the symmetry lowers to monoclinic (/M?/) if the cations are ordered" Thus, there is either a re-intepretation of orthorombic data as monoclinic, or a transition between ordered and disordered phase here, which do not change the cell angles. Thus I would say the angles are legit. > 1100066.cif: corrected trigonal/hexagonal unequal a/b from 9.048(1)/9.047673 to 9.047836. The cell dimensions are within error margin of each other, so we probably leave them as the authors have reported them. Of course your software is absolutely correct to merge the values if that is needed for your application. > 1503454.cif: corrected monoclinic b alpha from 89.990(6) to 90. corrected monoclinic b gamma from 89.995(6) to 90. Again, angles are within the specified error margins, and probably were refined (not fixed), so we leave them as they are in the COD. > > The file is a ~11mb so I'll send it using a file service (hightail) instead of as an attachment Thanks, I have stored it in our private repository, and we will consult the file when we have auotmatic procedure to fix some of the issues, or when we look for the issues that can be fixed manually... This is the catalogue of messages that I have extracted: > saulius at tasmanijos-velnias 2022-11-17/ $ awk -F: '{print > substr($2,2,12)}' cod_wrn.txt | sort | uniq -c | sort -nr -k1,1 | cat -n > ???? 1?? ?? 33414 merging equi > ???? 2?? ???? 967 clamped site > ???? 3?? ???? 277 _atom_site_a > ???? 4?? ???? 224 space group > ???? 5?? ???? 219 no space gro > ???? 6?? ????? 86 corrected tr > ???? 7?? ????? 84 corrected mo > ???? 8?? ????? 74 corrected te > ???? 9?? ????? 51 corrected or > ??? 10?? ????? 22 corrected he > ??? 11?? ????? 21 corrected cu > ??? 12?? ?????? 1 corrected rh Below, I analyse all unique messages from 'cod_err_flt.txt': > saulius at tasmanijos-velnias 2022-11-17/ $ awk -F: '{print > substr($2,2,20)}' cod_err_flt.txt | sort | uniq -c | sort -nr -k1,1 > ??? 124 _atom_site_aniso_lab > ???? 51 failed to unambiguou > ???? 30 cif block has confli > ???? 17 failed to parse spac > ????? 5 cif block contains l > ????? 3 cif block tag '_chem > ????? 1 monoclinic groups mu > ????? 1 line 86 isn't commen > ????? 1 line 55212 isn't com > ????? 1 line 490 isn't comme > ????? 1 line 132 isn't comme > ????? 1 line 128 isn't comme > ????? 1 cif block loop has m > ????? 1 cif block has loop r The '_atom_site_aniso_lab' message is genuine validation warning; we'll triage and probably fix in due time, if it is possible to fix at all. The 'failed to unambiguously determine space group setting' message is correct, there seem to be no symmetry operators nor Hall symbol in these files. International Tables would imply a default setting, but this might be dangerous to assume. Probably we can not fix this unless authors confirm the setting hey used, or the setting is recorded in the paper. The 'failed to parse space group from string' message is correct; these are either incorrectly recorded space groups from the original CIFs (which we can not fix, probably), or modulated structures which have superspacegoup name instead of the space group name; these we can probably fix some day when we have superspacegroup reduction to spacegroup code integrated into the COD pipeline. The following finding is for me quite worrying: > saulius at tasmanijos-velnias 2022-11-17/ $ grep 'cif block has > conflicted hall symbol' cod_err_flt.txt? | head -5 > 1010928.cif: cif block has conflicted hall symbol (-P 3* 2n) and space > group operators (recovered p_3*_2_-1n) > 1010956.cif: cif block has conflicted hall symbol (-P 2n 2a) and space > group operators (recovered p_2bc_2ac_-1ac) > 1010962.cif: cif block has conflicted hall symbol (-P 3* 2n) and space > group operators (recovered p_3*_2_-1n) > 1011149.cif: cif block has conflicted hall symbol (-P 2n 2a) and space > group operators (recovered p_2bc_2a_-1a) > 2002944.cif: cif block has conflicted hall symbol (-P 4 2ab) and space > group operators (recovered p_4_2ab) Indeed, the Hall symbols and the symmetry operators in the structures do not match (30 cases). We'll have to look at the original publications to find out why this is so. We'll add the code to check symop-Hall symbol correspondence to our COD check routines. Thanks for pointing this out! > saulius at tasmanijos-velnias 2022-11-17/ $ grep ': line' cod_err_flt.txt > 4029286.cif: line 490 isn't comment or part of loop row of cif but > doesn't have _ > 7223602.cif: line 86 isn't comment or part of loop row of cif but > doesn't have _ > 7228312.cif: line 128 isn't comment or part of loop row of cif but > doesn't have _ > 7238658.cif: line 132 isn't comment or part of loop row of cif but > doesn't have _ > 7705257.cif: line 55212 isn't comment or part of loop row of cif but > doesn't have _ It is not quite clear what this error message is saying but yes, all these cases are not comments and not parts of loops; they are parts of multi-line text fields delimited by ';' tokens and can contain arbitrary text (well, nearly arbitrary). Could it be that your CIF parser misses the beginning of a text field? Of these files, the 7705257.cif contains a garbled HKL Fobs reflection list, the rest seem OK. > saulius at tasmanijos-velnias 2022-11-17/ $ grep 'cif block contains l' > cod_err_flt.txt > 4002451.cif: cif block contains loop_ not followed by any tags at line 59 > 4130765.cif: cif block contains loop_ not followed by any tags at line 943 > 7035327.cif: cif block contains loop_ not followed by any tags at line > 6035 > 7035331.cif: cif block contains loop_ not followed by any tags at line > 6694 > 7035332.cif: cif block contains loop_ not followed by any tags at line > 6802 These files (as all COD files) are syntactically OK and tags do follow the 'loop_' token. Cold it be that you parser fails to discard spaces at the beginning of the line before the tags? > saulius at tasmanijos-velnias 2022-11-17/ $ grep "cif block tag '_chem" > cod_err_flt.txt > 4034776.cif: cif block tag '_chemical_name_systematic' followed by new > line and quoted string at 36 but quoted string doesn't close or fill > entire line > 7201872.cif: cif block tag '_chemical_name_common' followed by new > line and quoted string at 41 but quoted string doesn't close or fill > entire line > 7233594.cif: cif block tag '_chemical_name_systematic' followed by new > line and quoted string at 37 but quoted string doesn't close or fill > entire line The situation is funny with these files. Syntactically, they are correct ? the CIF syntax [1] permits any as a trailing character of a , including a single quote ("'")! So your parser misleads us: there is no quoted string in these cases, but an /unquoted/ string instead that is terminated with a quote (which is a part of the value). However, chemical names in all these files seem to have a superfluous trailing quote that should not be included in the name. We should curate these entries according to the chemical names given in the respective papers. Our CIF checker (cif_cod_check) could probably issue a warning when such strange chemical names are encountered... > saulius at tasmanijos-velnias 2022-11-17/ $ grep 'cif block loop has m' > cod_err_flt.txt > 4301644.cif: cif block loop has multi line delimeter token mid line at > line 134 Again, this is OK syntactically but does not represent the intended data. CIF syntax is weird... Another check needed for 'cif_cod_check'? I have fixed the entry in the COD :). > saulius at tasmanijos-velnias 2022-11-17/ $ grep 'cif block has loop r' > cod_err_flt.txt > 4342694.cif: cif block has loop row with 4 columns at line 153 but > loop has 2 column headers This is a false alarm; the loop is perfectly OK: > saulius at tasmanijos-velnias 2022-11-17/ $ sed -n '152,153p' > $(codid2file 4342694) > _space_group_symop_operation_xyz > 1 x,y,z 2 -x,-y,-z Typing two data packets ('loop rows') in one physical line is perfectly OK in CIF. Regards, Saulius Refs.: [1] IUCr. CIF v1.1 File Syntax. URL: https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#gram [accessed 2022-11-18T18:23+02:00]. [2] Muller, J.; Joubert, J. C. & Marezio, M. Etude des phases du syst?me FeVO4?VO2, obtenues par synth?se hydrothermale ? 70 kbar et 1000textdegreeC /Journal of Solid State Chemistry,//Elsevier BV,/*1976*/, 18/, 357-362, DOI: https://doi.org/10.1016/0022-4596(76)90118-3 Sincerely yours, Saulius -- Dr. Saulius Gra?ulis Vilnius University, Life Science Center, Institute of Biotechnology Saul?tekio al. 7, LT-10257 Vilnius, Lietuva (Lithuania) phone (office): (+370-5)-2234353, mobile: (+370-684)-49802, (+370-614)-36366 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 665 bytes Desc: OpenPGP digital signature URL: