[Cod-bugs] Number of entries in smiles.txt do not match cif entries.

Antanas Vaitkus antanas.vaitkus90 at gmail.com
Tue Nov 15 16:12:38 EET 2022


Dear Vladas,

On Tue, 15 Nov 2022 at 15:32, Vladas Oleinikovas <
voleinikovas at monterosatx.com> wrote:

> Hi!
>
> Firstly, thanks for an amazing repo and great documentation!
>

It is good to hear that you find the COD useful.

I have recently downloaded COD using command:
> >wget http://www.crystallography.net/archives/cod-cifs-mysql.zip
> After unzipping I found cif and mysql directories – as expected.
>
> Looking at files in mysql entries I caught interest of smiles.txt file.
> This looks very useful for searching the molecules of interest, especially
> the organic ones, that I am interested. I assume this relates to this paper
> (https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0279-6),
> is that correct?
>

Yes, the paper describes the overall workflow used to create the SMILES
strings as well as the conventions employed to represent various compounds
which do not fit well in the bond valence model that the SMILES format is
based on.


> Counting entries in this file, however, I find the number of entries
> significantly smaller than the reported number of entries on the title page
> (“Currently there are 494800 entries in the COD”):
> ~/COD/mysql:> wc -l smiles.txt
>
> > 219646 smiles.txt
>
> Is this because the file is not being updated, or does that exclude
> entries that were unable to be converted into SMILES?
>

Since the COD SMILES strings are generated semi-manually by one of our
volunteer chemists (for more details see the paper you cited earlier), the
overall process is quite slow. The SMILES dataset is still routinely
updated and hopefully will eventually cover a more significant part of the
COD.

We are also working on a more automated approach for deriving chemical
descriptions from crystallographic data (CIF -> SMILES, SDF, DWAR, etc.)
which will provide an alternative way of searching for chemical compounds
in the COD. The manuscript is still in preparation, but I can send you a
link to the paper once it is in the published if you are interested.


> Many thanks for your reply!
>

Hopefully this answers your question. Please let me know if you have any
further questions or comments.

>
> Best wishes,
> Vladas
>
> P.S. Feel free to answer in Lithuanian, if preferred 😊
>

I do prefer Lithuanian, but decided to reply in English in case I need to
answer the same question to a non-Lithuanian speakers in the future.

Sincerely,
Antanas Vaitkus

The mailing list


>
> --
> This message has been scanned for viruses and
> dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is
> believed to be clean.
> _______________________________________________
> Cod-bugs mailing list
> Cod-bugs at lists.crystallography.net
> http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs
>


-- 
Antanas Vaitkus,
Vilnius University,
Life Sciences Center,
Institute of Biotechnology,
room C521, Saulėtekio al. 7,
LT-10257 Vilnius, Lithuania

-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20221115/6a2a2b9f/attachment.htm>


More information about the Cod-bugs mailing list