[Cod-bugs] A quick list of crystals with SMILES

Antanas Vaitkus antanas.vaitkus90 at gmail.com
Tue Aug 5 16:18:06 EEST 2025


Hello Quinny,

The chemical information in the COD is currently a bit fragmented and
incomplete,
but you should still be able to produce a sizable dataset that meets your
quality
criteria by querying the COD SQL database.

I provide an example query which you can further modify based on your needs:
```
mysql -u cod_reader -h sql.crystallography.net cod -e 'select A.`cod_id`,
A.`value`, B.`chemname`, B.`commonname`, B.`formula`, B.`calcformula` from
`smiles` as A join `data` as B on (A.`cod_id` = B.`file`) limit 10'
```

Running this command-line query will return the COD ID, SMILES, chemical
names,
and the declared and calculated chemical formulas. A more detailed
description on
how each database field should be interpreted (e.g.  what is the difference
between
a chemname and commonname) is provided in the following XML file:

https://www.crystallography.net/cod/xml/documents/database-description/database-description.xml

Note, that COD structures may contain incorrect chemical names or no
chemical names
at all since this information is currently taken from the original CIF
files without further
validating it.

Alternatively, you can download a single file with all of the COD SMILES
strings from:

https://www.crystallography.net/cod/smi/allcod.smi

However, note that the file only contains the COD ID and SMILES strings,
therefore you will
need to get the chemical formulas and names from some other source (e.g.
the SQL database
or directly from CIF files). The same file can also be downloaded from the
Subversion repository
under:

svn://www.crystallography.net/cod/smi/allcod.smi

The SMILES file contains the same SMILES as the SQL database and currently
covers about
half of the COD (the 250k you previously mentioned). The file is only about
25 MiB in size so
should fit well on your machine.

Finally, if you plan working with the COD SMILES, it might be useful to
familiarise yourself
with the SMILES conventions that we use:

https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0279-6

Please let us know if you have any further questions.

Sincerely,
Antanas

On Fri, 1 Aug 2025 at 17:39, Quinny Campbell <quinnycamp at meta.com> wrote:

> Hello,
>
> I am Quinny, PhD student, and I'm working on developing AI tools to
> support crystallization works.
>
> I'd love to access a quick list of all molecules in COD. Also, I'd like to
> get SMILES if it exists (I see that it's only a bit less than 250k so far).
> Is there an easy way to do this? I don't need ANY of cif files — just
> identifier, name, SMILES, and molecular formula. Preferably no duplicates,
> but I can deduplicate if needed.
>
> I tried to obtain COD by downloading it via subversion. My disk space
> maximized out, as it is 158 GiB so far. I quickly realized that downloading
> the entire COD isn't the best solution. There's no way to do multiple
> queries quickly via web. What options do I have?
>
> Thanks!
> Quinny
>
> --
> This message has been scanned for viruses and
> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
> believed to be clean.
> _______________________________________________
> Cod-bugs mailing list
> Cod-bugs at lists.crystallography.net
> http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs
>


-- 
Antanas Vaitkus,
Vilnius University,
Life Sciences Center,
Institute of Biotechnology,
room C521, Saulėtekio al. 7,
LT-10257 Vilnius, Lithuania

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20250805/c6f53432/attachment.htm>


More information about the Cod-bugs mailing list