[Cod-bugs] Question about search by SMILES

Andrius Merkys andrius.merkys at gmail.com
Tue Jun 18 11:51:33 EEST 2019


Dear Jiuyang,

To search for SMILES/SMARTS we use the Open Babel package [1]. The idea
behind the current implementation in the COD is that the database is
queried for all structures containing (not matching exactly) the
requested fragment, in your case the benzene ring. The returned entries
aren't sorted by relevance, but merely by their IDs in the database.
While sorting by relevance might be implemented in principle, it depends
a lot on how the relevance is determined. I would argue that
Levenshtein's distance isn't general enough to be applied in all the
cases. Other distances, for example Tanimoto index of fingerprint
similarity, might be investigated, but currently it's out of the scope
for the COD.

Should you find COD's search capabilities too basic, you can download
the whole COD SMILES database [2] for local examination.

Best wishes,
Andrius

[1] http://openbabel.org
[2] http://crystallography.net/cod/smi/allcod.smi

On 2019-06-17 22:21, J. Zhao wrote:
>
> But I've noticed that if you search for 'c1ccccc1', actually non of
> the structures on the first few pages are really Benzene's (c1ccccc1)
> structure. But they all contain string 'c1ccccc1' in their SMILES.
>
> For example, 'O[C@@H](C)[C@@H]([C@@H](C)c1ccccc1)c1ccccc1', which is
> the SMILES for '3,4-Diphenylpentan-2-ol' appears as the 13th entry in
> the search result.
>
> To be honest, other databases search results seem to have this problem
> as well. Although I have no idea how COD ranks its search result, I
> would suggest rank them by some distance measurements between search
> results' SMILES and the target SMILES if possible,  like Levenshtein
> distance maybe? Then the entry with SMILES 'c1ccccc1' will rank first
> since the Levenshtein distance is 0 and it is indeed the correct
> structure for query 'c1ccccc1'.
>
-- 
Andrius Merkys
Vilnius University Institute of Biotechnology, Saulėtekio al. 7, room V325
LT-10257 Vilnius, Lithuania




More information about the Cod-bugs mailing list