<div dir="ltr"><div><div>Hello Quinny,<br><br></div>The chemical information in the COD is currently a bit fragmented and incomplete,<br>but you should still be able to produce a sizable dataset that meets your quality<br>criteria by querying the COD SQL database.<br><br></div><div>I provide an example query which you can further modify based on your needs:<br>```<br>mysql -u cod_reader -h <a href="http://sql.crystallography.net">sql.crystallography.net</a> cod -e 'select A.`cod_id`, A.`value`, B.`chemname`, B.`commonname`, B.`formula`, B.`calcformula` from `smiles` as A join `data` as B on (A.`cod_id` = B.`file`) limit 10'<br>```</div><div><br></div><div>Running this command-line query will return the COD ID, SMILES, chemical names,<br>and the declared and calculated chemical formulas. A more detailed description on<br>how each database field should be interpreted (e.g. what is the difference between<br>a chemname and commonname) is provided in the following XML file:<br><br><a href="https://www.crystallography.net/cod/xml/documents/database-description/database-description.xml">https://www.crystallography.net/cod/xml/documents/database-description/database-description.xml</a><br><br></div><div>Note, that COD structures may contain incorrect chemical names or no chemical names<br>at all since this information is currently taken from the original CIF files without further<br>validating it.<br><br></div><div>Alternatively, you can download a single file with all of the COD SMILES strings from:<br><br><a href="https://www.crystallography.net/cod/smi/allcod.smi">https://www.crystallography.net/cod/smi/allcod.smi</a><br><br>However, note that the file only contains the COD ID and SMILES strings, therefore you will<br>need to get the chemical formulas and names from some other source (e.g. the SQL database<br>or directly from CIF files). The same file can also be downloaded from the Subversion repository<br>under:<br><br>svn://<a href="http://www.crystallography.net/cod/smi/allcod.smi">www.crystallography.net/cod/smi/allcod.smi</a></div><div><br></div><div>The SMILES file contains the same SMILES as the SQL database and currently covers about<br>half of the COD (the 250k you previously mentioned). The file is only about 25 MiB in size so<br>should fit well on your machine.<br><br></div><div>Finally, if you plan working with the COD SMILES, it might be useful to familiarise yourself<br></div><div>with the SMILES conventions that we use:<br><br><a href="https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0279-6">https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0279-6</a></div><div><br></div><div>Please let us know if you have any further questions.<br><br></div><div>Sincerely,<br></div><div>Antanas</div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Fri, 1 Aug 2025 at 17:39, Quinny Campbell <<a href="mailto:quinnycamp@meta.com">quinnycamp@meta.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="msg-5986656569890159509">
<div dir="ltr">
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Hello,</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I am Quinny, PhD student, and I'm working on developing AI tools to support crystallization works.</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I'd love to access a quick list of all molecules in COD. Also, I'd like to get SMILES if it exists (I see that it's only a bit less than 250k so far). Is there an easy way to do this? I don't need ANY of cif files — just identifier, name, SMILES, and molecular
formula. Preferably no duplicates, but I can deduplicate if needed. </div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
I tried to obtain COD by downloading it via subversion. My disk space maximized out, as it is 158 GiB so far. I quickly realized that downloading the entire COD isn't the best solution. There's no way to do multiple queries quickly via web. What options do
I have? </div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
<br>
</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Thanks!</div>
<div style="font-family:Aptos,Aptos_EmbeddedFont,Aptos_MSFontService,Calibri,Helvetica,sans-serif;font-size:12pt;color:rgb(0,0,0)">
Quinny</div>
<br>--
<br>This message has been scanned for viruses and
<br>dangerous content by
<a href="http://www.mailscanner.info/" target="_blank"><b>MailScanner</b></a>, and is
<br>believed to be clean.
</div>
_______________________________________________<br>
Cod-bugs mailing list<br>
<a href="mailto:Cod-bugs@lists.crystallography.net" target="_blank">Cod-bugs@lists.crystallography.net</a><br>
<a href="http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs" rel="noreferrer" target="_blank">http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs</a><br>
</div></blockquote></div><div><br clear="all"></div><br><span class="gmail_signature_prefix">-- </span><br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><div>Antanas Vaitkus,<br></div>Vilnius University,<br>Life Sciences Center,<br>Institute of Biotechnology,<br><span><span><span>room C521, </span></span></span>Saulėtekio al. 7,<br>LT-10257 Vilnius, Lithuania<br></div><div><div><div><div><div dir="ltr"><div><div dir="ltr"><div><div dir="ltr"><br><br></div></div></div></div></div></div></div></div></div></div></div>
<br />--
<br />This message has been scanned for viruses and
<br />dangerous content by
<a href="http://www.mailscanner.info/"><b>MailScanner</b></a>, and is
<br />believed to be clean.