Difference between revisions of "Tutorial:Fingerprints"

From Open Babel
Jump to: navigation, search
(Major rearrangement of the text)
Line 1: Line 1:
Molecular "fingerprints" compose bits of molecular information such as types of rings, functional groups, and other types of molecular and atomic data. Comparing fingerprints will allow you to determine the similarity between two molecules, search databases, etc., but does not include full structural data (i.e., coordinates).
+
'''Molecular fingerprints''' encode molecular structure in a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints will allow you to determine the similarity between two molecules, search databases, etc., but does not include full structural data (such as coordinates).
 +
 
 +
== Available fingerprints ==
  
 
You can see the available fingerprints by typing the following command:
 
You can see the available fingerprints by typing the following command:
  
 
  PROMPT> babel -L fingerprints
 
  PROMPT> babel -L fingerprints
 +
FP2    Indexes linear fragments up to 7 atoms.
 +
FP3    SMARTS patterns specified in the file patterns.txt
 +
FP4    SMARTS patterns specified in the file SMARTS_InteLigand.txt
 +
MACCS    SMARTS patterns specified in the file MACCS.txt
  
FP2 -- Indexes linear fragments up to 7 atoms.
+
At present there are four types of fingerprints: FP2, a path-based fingerprint which indexes small molecule fragments (somewhat similar to the Daylight fingerprints), fingerprint types FP3 and FP4 which both use a series of SMARTS queries that are stored in <code>patterns.txt</code> and <code>SMARTS_InteLigand.txt</code>, and a MACCS fingerprint that uses the SMARTS pattersn in MACCS.txt.
FP3 -- SMARTS patterns specified in the file patterns.txt
+
FP4 -- SMARTS patterns specified in the file SMARTS_InteLigand.txt
+
  
At present there are three types of fingerprints: FP2, which indexes small molecule fragments, fingerprint type FP3, and FP4 which both use a series of SMARTS queries that are stored in <code>patterns.txt</code> and <code>SMARTS_InteLigand.txt</code> (You can add your own SMARTS queries to these files.) On UNIX and Mac systems, these are frequently found in <code>/usr/local/share/openbabel</code> under a directory for each version of Open Babel.
+
Note that you can tailor these fingerprints to your own needs by adding your own SMARTS queries to these files. On UNIX and Mac systems, these files are frequently found in <code>/usr/local/share/openbabel</code> under a directory for each version of Open Babel.
  
For relatively small datasets (<10,000's) it is possible to do similarity searches without the need to build a similarity index, however larger datasets (upto 100,000's) can be searched rapidly once a fastsearch index has been built.
+
== Similarity searching ==
  
So on small datasets these fingerprints can be used in a variety of ways. For example, the command:
+
For relatively small datasets (<10,000's) it is possible to do similarity searches without the need to build a similarity index, however larger datasets (up to 100,000's) can be searched rapidly once a fastsearch index has been built.
  
PROMPT>  babel  'mymols.sdf'  -ofpt
+
=== Small datasets ===
  
MOL_00000067
+
On small datasets these fingerprints can be used in a variety of ways. The following command gives you the Tanimoto coefficient between a SMILES string in mysmiles.smi and all the molecules in mymols.sdf:
MOL_00000083  Tanimoto from MOL_00000067 = 0.810811
+
MOL_00000105  Tanimoto from MOL_00000067 = 0.833333
+
MOL_00000296  Tanimoto from MOL_00000067 = 0.425926
+
MOL_00000320  Tanimoto from MOL_00000067 = 0.534884
+
MOL_00000328  Tanimoto from MOL_00000067 = 0.511111
+
MOL_00000338  Tanimoto from MOL_00000067 = 0.522727
+
MOL_00000354  Tanimoto from MOL_00000067 = 0.534884
+
MOL_00000378  Tanimoto from MOL_00000067 = 0.489362
+
MOL_00000391  Tanimoto from MOL_00000067 = 0.489362
+
10 molecules converted
+
 
+
will give you the Tanimoto coefficient between the first molecule in mymols.sdf and each of the subsequent ones. You don't have to have all the structures in the same file or even the same format. So the following command gives you the Tanimoto coefficient between a SMILES string in mysmiles.smi and all the molecules in mymols.sdf:
+
  
 
  PROMPT>  babel  'mysmiles.smi'  'mymols.sdf' -ofpt
 
  PROMPT>  babel  'mysmiles.smi'  'mymols.sdf' -ofpt
 
 
  MOL_00000067  Tanimoto from first mol = 0.0888889
 
  MOL_00000067  Tanimoto from first mol = 0.0888889
 
  MOL_00000083  Tanimoto from first mol = 0.0869565
 
  MOL_00000083  Tanimoto from first mol = 0.0869565
Line 45: Line 36:
 
  11 molecules converted
 
  11 molecules converted
  
If you wanted to know the similarity between only the substituted bromobenzenes in mymols.sdf then you might combine commands like this:
+
The default fingerprint used is the FP2 fingerprint. You change the fingerprint using the "f" output option as follows:
 +
 
 +
PROMPT>  babel 'mymols.sdf' -ofpt -xfFP3
 +
 
 +
The "-s" option of babel is used to filter by SMARTS string (see [[Babel]]). If you wanted to know the similarity only to the substituted bromobenzenes in mymols.sdf then you might combine commands like this ('''note:''' if the query molecule does not match the SMARTS string this will not work as expected, as the first molecule in the database that matches the SMARTS string will instead be used as the query):
 +
 
 +
PROMPT>  babel 'mysmiles.smi' 'mymols.sdf' -ofpt -s 'c1ccccc1Br'
 +
MOL_00000067  Tanimoto from first mol = 0.0888889
 +
MOL_00000083  Tanimoto from first mol = 0.0869565
 +
MOL_00000105  Tanimoto from first mol = 0.0888889
 +
 
 +
If you don't specify a query file, babel will just use the first molecule in the database as the query:
 +
 
 +
PROMPT>  babel  'mymols.sdf'  -ofpt
  
PROMPT>  babel 'mymols.sdf' -ofpt -s 'c1ccccc1Br'
 
 
  MOL_00000067
 
  MOL_00000067
 
  MOL_00000083  Tanimoto from MOL_00000067 = 0.810811
 
  MOL_00000083  Tanimoto from MOL_00000067 = 0.810811
 
  MOL_00000105  Tanimoto from MOL_00000067 = 0.833333
 
  MOL_00000105  Tanimoto from MOL_00000067 = 0.833333
 +
MOL_00000296  Tanimoto from MOL_00000067 = 0.425926
 +
MOL_00000320  Tanimoto from MOL_00000067 = 0.534884
 +
MOL_00000328  Tanimoto from MOL_00000067 = 0.511111
 +
MOL_00000338  Tanimoto from MOL_00000067 = 0.522727
 +
MOL_00000354  Tanimoto from MOL_00000067 = 0.534884
 +
MOL_00000378  Tanimoto from MOL_00000067 = 0.489362
 +
MOL_00000391  Tanimoto from MOL_00000067 = 0.489362
 +
10 molecules converted
  
You change the fingerprint using the following command.
+
=== Large datasets ===
  
PROMPT>  babel 'mymols.sdf' -ofpt -xfFP3
+
On larger datasets it is necessary to first build a '''fastsearch index'''. This is an new file that stores a database of fingerprints for the files indexed. You will still need to keep both the new <code>.fs</code> fastsearch index and the original files. However, the new index will allow significantly faster searching and similarity comparisons. The index is created with the following command:
 
+
On larger datasets it is necessary to first build the index using the command
+
  
 
  PROMPT>  babel mymols.sdf -ofs
 
  PROMPT>  babel mymols.sdf -ofs
  
This builds mymols.fs with the default fingerprint, unfolded. To use it to find the top 5 matches to molecule in target.sdf:
+
This builds mymols.fs with the default fingerprint (unfolded). The following command uses the index to find the 5 most similar molecules to the molecule in query.mol:  
  
  PROMPT>  babel mymols.fs results.sdf -Starget.sdf -at5
+
  PROMPT>  babel mymols.fs results.sdf -Squery.mol -at5
  
 
or to get the matches with Tanimoto>0.6 to 1,2-dicyanobenzene:
 
or to get the matches with Tanimoto>0.6 to 1,2-dicyanobenzene:
Line 68: Line 77:
 
  PROMPT>  babel mymols.fs results.sdf -sN#Cc1ccccc1C#N -at0.6
 
  PROMPT>  babel mymols.fs results.sdf -sN#Cc1ccccc1C#N -at0.6
  
== Fastsearch Indexing ==
+
== Substructure searching ==  
  
You can also do substructure searching using by creating a "fastsearch" index. This is an new file that stores a database of fingerprints for the files indexed. You will still need to keep both the new <code>.fs</code> fastsearch index and the original files. However, the new index will allow significantly faster searching and similarity comparisons.
+
=== Small datasets ===
  
 
This command will find all molecules containing 1,2-dicyanobenzene and return the results as SMILES strings:
 
This command will find all molecules containing 1,2-dicyanobenzene and return the results as SMILES strings:
 +
 +
PROMPT>  babel mymols.sdf -sN#Cc1ccccc1C#N results.smi
 +
 +
If all you want output are the molecule names then adding <code>-xt</code> will return just the molecule names.
 +
 +
PROMPT>  babel mymols.sdf -sN#Cc1ccccc1C#N results.smi -xt
 +
 +
=== Large datasets ===
 +
 +
First of all, you need to create a '''fastsearch index''' (see above). The index is created with the following command:
 +
 +
PROMPT>  babel mymols.sdf -ofs
 +
 +
Substructure searching is as for small datasets, except that the fastsearch index is used instead of the original file. This command will find all molecules containing 1,2-dicyanobenzene and return the results as SMILES strings:
  
 
  PROMPT>  babel mymols.fs -ifs -sN#Cc1ccccc1C#N results.smi
 
  PROMPT>  babel mymols.fs -ifs -sN#Cc1ccccc1C#N results.smi
  
If all you want output are the molecule names then adding <code>-xt</code> will return just the moleculae names.
+
If all you want output are the molecule names then adding <code>-xt</code> will return just the molecule names.
  
 
  PROMPT>  babel mymols.fs -ifs -sN#Cc1ccccc1C#N results.smi -xt
 
  PROMPT>  babel mymols.fs -ifs -sN#Cc1ccccc1C#N results.smi -xt

Revision as of 10:18, 30 January 2010

Molecular fingerprints encode molecular structure in a series of binary digits (bits) that represent the presence or absence of particular substructures in the molecule. Comparing fingerprints will allow you to determine the similarity between two molecules, search databases, etc., but does not include full structural data (such as coordinates).

Available fingerprints

You can see the available fingerprints by typing the following command:

PROMPT> babel -L fingerprints
FP2    Indexes linear fragments up to 7 atoms.
FP3    SMARTS patterns specified in the file patterns.txt
FP4    SMARTS patterns specified in the file SMARTS_InteLigand.txt
MACCS    SMARTS patterns specified in the file MACCS.txt

At present there are four types of fingerprints: FP2, a path-based fingerprint which indexes small molecule fragments (somewhat similar to the Daylight fingerprints), fingerprint types FP3 and FP4 which both use a series of SMARTS queries that are stored in patterns.txt and SMARTS_InteLigand.txt, and a MACCS fingerprint that uses the SMARTS pattersn in MACCS.txt.

Note that you can tailor these fingerprints to your own needs by adding your own SMARTS queries to these files. On UNIX and Mac systems, these files are frequently found in /usr/local/share/openbabel under a directory for each version of Open Babel.

Similarity searching

For relatively small datasets (<10,000's) it is possible to do similarity searches without the need to build a similarity index, however larger datasets (up to 100,000's) can be searched rapidly once a fastsearch index has been built.

Small datasets

On small datasets these fingerprints can be used in a variety of ways. The following command gives you the Tanimoto coefficient between a SMILES string in mysmiles.smi and all the molecules in mymols.sdf:

PROMPT>  babel  'mysmiles.smi'  'mymols.sdf' -ofpt
MOL_00000067   Tanimoto from first mol = 0.0888889
MOL_00000083   Tanimoto from first mol = 0.0869565
MOL_00000105   Tanimoto from first mol = 0.0888889
MOL_00000296   Tanimoto from first mol = 0.0714286
MOL_00000320   Tanimoto from first mol = 0.0888889
MOL_00000328   Tanimoto from first mol = 0.0851064
MOL_00000338   Tanimoto from first mol = 0.0869565
MOL_00000354   Tanimoto from first mol = 0.0888889
MOL_00000378   Tanimoto from first mol = 0.0816327
MOL_00000391   Tanimoto from first mol = 0.0816327
11 molecules converted

The default fingerprint used is the FP2 fingerprint. You change the fingerprint using the "f" output option as follows:

PROMPT>  babel 'mymols.sdf' -ofpt -xfFP3

The "-s" option of babel is used to filter by SMARTS string (see Babel). If you wanted to know the similarity only to the substituted bromobenzenes in mymols.sdf then you might combine commands like this (note: if the query molecule does not match the SMARTS string this will not work as expected, as the first molecule in the database that matches the SMARTS string will instead be used as the query):

PROMPT>  babel 'mysmiles.smi' 'mymols.sdf' -ofpt -s 'c1ccccc1Br'
MOL_00000067   Tanimoto from first mol = 0.0888889
MOL_00000083   Tanimoto from first mol = 0.0869565
MOL_00000105   Tanimoto from first mol = 0.0888889

If you don't specify a query file, babel will just use the first molecule in the database as the query:

PROMPT>  babel  'mymols.sdf'  -ofpt
MOL_00000067
MOL_00000083   Tanimoto from MOL_00000067 = 0.810811
MOL_00000105   Tanimoto from MOL_00000067 = 0.833333
MOL_00000296   Tanimoto from MOL_00000067 = 0.425926
MOL_00000320   Tanimoto from MOL_00000067 = 0.534884
MOL_00000328   Tanimoto from MOL_00000067 = 0.511111
MOL_00000338   Tanimoto from MOL_00000067 = 0.522727
MOL_00000354   Tanimoto from MOL_00000067 = 0.534884
MOL_00000378   Tanimoto from MOL_00000067 = 0.489362
MOL_00000391   Tanimoto from MOL_00000067 = 0.489362
10 molecules converted 

Large datasets

On larger datasets it is necessary to first build a fastsearch index. This is an new file that stores a database of fingerprints for the files indexed. You will still need to keep both the new .fs fastsearch index and the original files. However, the new index will allow significantly faster searching and similarity comparisons. The index is created with the following command:

PROMPT>  babel mymols.sdf -ofs

This builds mymols.fs with the default fingerprint (unfolded). The following command uses the index to find the 5 most similar molecules to the molecule in query.mol:

PROMPT>  babel mymols.fs results.sdf -Squery.mol -at5

or to get the matches with Tanimoto>0.6 to 1,2-dicyanobenzene:

PROMPT>  babel mymols.fs results.sdf -sN#Cc1ccccc1C#N -at0.6

Substructure searching

Small datasets

This command will find all molecules containing 1,2-dicyanobenzene and return the results as SMILES strings:

PROMPT>  babel mymols.sdf -sN#Cc1ccccc1C#N results.smi

If all you want output are the molecule names then adding -xt will return just the molecule names.

PROMPT>  babel mymols.sdf -sN#Cc1ccccc1C#N results.smi -xt

Large datasets

First of all, you need to create a fastsearch index (see above). The index is created with the following command:

PROMPT>  babel mymols.sdf -ofs

Substructure searching is as for small datasets, except that the fastsearch index is used instead of the original file. This command will find all molecules containing 1,2-dicyanobenzene and return the results as SMILES strings:

PROMPT>  babel mymols.fs -ifs -sN#Cc1ccccc1C#N results.smi

If all you want output are the molecule names then adding -xt will return just the molecule names.

PROMPT>  babel mymols.fs -ifs -sN#Cc1ccccc1C#N results.smi -xt