Tutorial:Fingerprints

From Open Babel
Revision as of 05:08, 11 July 2008 by Baoilleach (Talk | contribs) (Category fingerprints)

Jump to: navigation, search

Molecular "fingerprints" compose bits of molecular information such as types of rings, functional groups, and other types of molecular and atomic data. Comparing fingerprints will allow you to determine the similarity between two molecules, search databases, etc., but does not include full structural data (i.e., coordinates).

You can see the available fingerprints by typing the following command:

PROMPT> babel -F
FP2 -- Indexes linear fragments up to 7 atoms.
FP3 -- SMARTS patterns specified in the file patterns.txt
FP4 -- SMARTS patterns specified in the file SMARTS_InteLigand.txt

At present there are three types of fingerprints: FP2, which indexes small molecule fragments, fingerprint type FP3, and FP4 which both use a series of SMARTS queries that are stored in patterns.txt and SMARTS_InteLigand.txt (You can add your own SMARTS queries to these files.) On UNIX and Mac systems, these are frequently found in /usr/local/share/openbabel under a directory for each version of Open Babel.

For relatively small datasets (<10,000's) it is possible to do similarity searches without the need to build a similarity index, however larger datasets (upto 100,000's) can be searched rapidly once a fastsearch index has been built.

So on small datasets these fingerprints can be used in a variety of ways. For example, the command:

PROMPT>  babel  'mymols.sdf'  -ofpt
MOL_00000067
MOL_00000083   Tanimoto from MOL_00000067 = 0.810811
MOL_00000105   Tanimoto from MOL_00000067 = 0.833333
MOL_00000296   Tanimoto from MOL_00000067 = 0.425926
MOL_00000320   Tanimoto from MOL_00000067 = 0.534884
MOL_00000328   Tanimoto from MOL_00000067 = 0.511111
MOL_00000338   Tanimoto from MOL_00000067 = 0.522727
MOL_00000354   Tanimoto from MOL_00000067 = 0.534884
MOL_00000378   Tanimoto from MOL_00000067 = 0.489362
MOL_00000391   Tanimoto from MOL_00000067 = 0.489362
10 molecules converted

will give you the Tanimoto coefficient between the first molecule in mymols.sdf and each of the subsequent ones. You don't have to have all the structures in the same file or even the same format. So the following command gives you the Tanimoto coefficient between a SMILES string in mysmiles.smi and all the molecules in mymols.sdf:

PROMPT>  babel  'mysmiles.smi'  'mymols.sdf' -ofpt
MOL_00000067   Tanimoto from first mol = 0.0888889
MOL_00000083   Tanimoto from first mol = 0.0869565
MOL_00000105   Tanimoto from first mol = 0.0888889
MOL_00000296   Tanimoto from first mol = 0.0714286
MOL_00000320   Tanimoto from first mol = 0.0888889
MOL_00000328   Tanimoto from first mol = 0.0851064
MOL_00000338   Tanimoto from first mol = 0.0869565
MOL_00000354   Tanimoto from first mol = 0.0888889
MOL_00000378   Tanimoto from first mol = 0.0816327
MOL_00000391   Tanimoto from first mol = 0.0816327
11 molecules converted

If you wanted to know the similarity between only the substituted bromobenzenes in mymols.sdf then you might combine commands like this:

PROMPT>  babel 'mymols.sdf' -ofpt -s 'c1ccccc1Br'
MOL_00000067
MOL_00000083   Tanimoto from MOL_00000067 = 0.810811
MOL_00000105   Tanimoto from MOL_00000067 = 0.833333

You change the fingerprint using the following command.

PROMPT>  babel 'mymols.sdf' -ofpt -xfFP3

On larger datasets it is necessary to first build the index using the command

PROMPT>  babel mymols.sdf -ofs

This builds mymols.fs with the default fingerprint, unfolded. To use it to find the top 5 matches to molecule in target.sdf:

PROMPT>  babel mymols.fs results.sdf -Starget.sdf -at5

or to get the matches with Tanimoto>0.6 to 1,2-dicyanobenzene:

PROMPT>  babel mymols.fs results.sdf -sN#Cc1ccccc1C#N -at0.6

Fastsearch Indexing

You can also do substructure searching using by creating a "fastsearch" index. This is an new file that stores a database of fingerprints for the files indexed. You will still need to keep both the new .fs fastsearch index and the original files. However, the new index will allow significantly faster searching and similarity comparisons.

This command will find all molecules containing 1,2-dicyanobenzene and return the results as SMILES strings:

PROMPT>  babel mymols.fs -ifs -sN#Cc1ccccc1C#N results.smi

If all you want output are the molecule names then adding -xt will return just the moleculae names.

PROMPT>  babel mymols.fs -ifs -sN#Cc1ccccc1C#N results.smi -xt