Substructure and similarity searching a large dataset#

Open Babel provides a format called the fs -- fastsearch index which should be used when searching large datasets (like ChEMBL) for molecules similar to a particular query. There are faster ways of searching (like using a chemical database) but FastSearch is convenient, and should give reasonable performance for most people.

To demonstrate similarity searching, we will use the first 1000 molecules in the latest release of ChEMBL:

  • Download the 2D SDF version of ChEMBL, chembl_nn.sdf.gz, from the ChEMBLdb download site and save in your Work folder. (Note: this is a gzipped file, but Open Babel will handle this without problems.)

  • Set up an SDF to SDF conversion, set chembl_nn.sdf.gz as the input file and 1000_chembl.sdf as the output file.

  • Only convert the first 1000 molecules by entering 1000 in the box End import at molecule # specified.

../../_images/first1000.png
  • Click CONVERT

We can going to use the following structure for substructure and similarity searching. It can be represented by the SMILES string Nc1ccc(N)cc1.

../../_images/1%2C4-diamino-phenyl.png

Next, we will create a FastSearch index for this dataset of 1000 molecules:

  • Convert 1000_chembl.sdf from SDF to FS format, with an output filename of 1000_chembl.fs

By using this FastSearch index, the speed of substructure and similarity searching is much improved. First of all, let’s do a substructure search:

  • Set up a conversion from FS to SMILES with 1000_chembl.fs as the input file. Tick the box for Output below only and Display in Firefox

  • Enter Nc1ccc(N)cc1 into the box Convert only if match SMARTS or mol in file

  • Click CONVERT

  1. How does the speed of the substructure search compare to if you used 1000_chembl.sdf as the input file instead?

Next, let’s find the 5 most similar molecules to the same query. The Tanimoto coefficient of a path-based fingerprint is used as the measurement of similarity. This has a value from 0.0 to 1.0 (maximum similarity) and we will display the value below each molecule:

  • Set up the FS to SMILES conversion as before, and again enter Nc1ccc(N)cc1 into the box Convert only if match SMARTS or mol in file

  • Enter 5 into the box Do similarity search: #mols or # as min Tanimoto

  • Tick the box Add Tanimoto coefficient to title in similarity search

  • Click CONVERT

../../_images/similarity.png
  1. Look at the 5 most similar molecules. Can you tell why they were regarded as similar to the query?