[Cod-bugs] Number of entries in smiles.txt do not match cif entries.

Vladas Oleinikovas voleinikovas at monterosatx.com
Tue Nov 15 16:16:05 EET 2022


Sveiki,

Ačiū už išsamų atsakymą. Lauksiu žinių dėl naujos publikacijos 😊

Iki greito,
Vladas

From: Antanas Vaitkus <antanas.vaitkus90 at gmail.com>
Date: Tuesday, 15 November 2022 at 15:12
To: Vladas Oleinikovas <voleinikovas at monterosatx.com>
Cc: cod-bugs at ibt.lt <cod-bugs at ibt.lt>
Subject: Re: [Cod-bugs] Number of entries in smiles.txt do not match cif entries.
Dear Vladas,
On Tue, 15 Nov 2022 at 15:32, Vladas Oleinikovas <voleinikovas at monterosatx.com<mailto:voleinikovas at monterosatx.com>> wrote:
Hi!

Firstly, thanks for an amazing repo and great documentation!

It is good to hear that you find the COD useful.

I have recently downloaded COD using command:
>wget http://www.crystallography.net/archives/cod-cifs-mysql.zip<http://www.crystallography.net/archives/cod-cifs-mysql.zip>
After unzipping I found cif and mysql directories – as expected.

Looking at files in mysql entries I caught interest of smiles.txt file. This looks very useful for searching the molecules of interest, especially the organic ones, that I am interested. I assume this relates to this paper (https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0279-6<https://jcheminf.biomedcentral.com/articles/10.1186/s13321-018-0279-6>), is that correct?

Yes, the paper describes the overall workflow used to create the SMILES strings as well as the conventions employed to represent various compounds which do not fit well in the bond valence model that the SMILES format is based on.

Counting entries in this file, however, I find the number of entries significantly smaller than the reported number of entries on the title page (“Currently there are 494800 entries in the COD”):
~/COD/mysql:> wc -l smiles.txt
> 219646 smiles.txt

Is this because the file is not being updated, or does that exclude entries that were unable to be converted into SMILES?

Since the COD SMILES strings are generated semi-manually by one of our volunteer chemists (for more details see the paper you cited earlier), the overall process is quite slow. The SMILES dataset is still routinely updated and hopefully will eventually cover a more significant part of the COD.
We are also working on a more automated approach for deriving chemical descriptions from crystallographic data (CIF -> SMILES, SDF, DWAR, etc.) which will provide an alternative way of searching for chemical compounds in the COD. The manuscript is still in preparation, but I can send you a link to the paper once it is in the published if you are interested.

Many thanks for your reply!

Hopefully this answers your question. Please let me know if you have any further questions or comments.

Best wishes,
Vladas

P.S. Feel free to answer in Lithuanian, if preferred 😊

I do prefer Lithuanian, but decided to reply in English in case I need to answer the same question to a non-Lithuanian speakers in the future.
Sincerely,
Antanas Vaitkus

The mailing list


--
This message has been scanned for viruses and
dangerous content by MailScanner<http://www.mailscanner.info/>, and is
believed to be clean.
_______________________________________________
Cod-bugs mailing list
Cod-bugs at lists.crystallography.net<mailto:Cod-bugs at lists.crystallography.net>
http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs<http://lists.crystallography.net/cgi-bin/mailman/listinfo/cod-bugs>


--
Antanas Vaitkus,
Vilnius University,
Life Sciences Center,
Institute of Biotechnology,
room C521, Saulėtekio al. 7,
LT-10257 Vilnius, Lithuania


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.crystallography.net/pipermail/cod-bugs/attachments/20221115/759ef531/attachment-0001.htm>


More information about the Cod-bugs mailing list