Open Babel  3.0
Public Member Functions | List of all members
FastSearch Class Reference

#include <openbabel/fingerprint.h>

Public Member Functions

std::string ReadIndexFile (std::string IndexFilename)
 
std::string ReadIndex (std::istream *pIndexstream)
 
virtual ~FastSearch ()
 
bool Find (OBBase *pOb, std::vector< unsigned long > &SeekPositions, unsigned int MaxCandidates)
 
bool FindMatch (OBBase *pOb, std::vector< unsigned long > &SeekPositions, unsigned int MaxCandidates)
 
bool FindSimilar (OBBase *pOb, std::multimap< double, unsigned long > &SeekposMap, double MinTani, double MaxTani=1.1)
 
bool FindSimilar (OBBase *pOb, std::multimap< double, unsigned long > &SeekposMap, int nCandidates=0)
 
OBFingerprintGetFingerprint () const
 
const FptIndexHeaderGetIndexHeader () const
 

Detailed Description

Class to search fingerprint index files.

The FastSearch class searches an index to a datafile containing a list of molecules (or other objects) prepared by FastSearchIndexer.

OpenBabel can also search files for molecules containing a substructure specified by a SMARTS string, using OBSmartsPattern or from the command line:

obabel datafile.xxx -O outfile.yyy -sSMARTS

But when the data file contains more than about 10,000 molecules this becomes rather too slow to be used interactively. To do it more quickly, an index of the molecules containing their fingerprints (see OBFingerprint) is prepared using FastSearchIndexer. The indexing process may take a long time but only has to be done once. The index can be searched very quickly with FastSearch. Because of the nature of fingerprints a match to a bit does not guarantee the presence of a particular substructure or other molecular property, so that a definitive answer may require a subsequent search of the (much reduced) list of candidates by another method (like OBSmartsPattern).

Note that the index files are not portable. They need to be prepared on the computer that will access them.

Using FastSearch and FastSearchIndexer in a program

The index has two tables:

To prepare an fastsearch index file:

To search in a fastsearch index file

To do a search for molecules which have all the substructure bits the OBMol object, patternMol

vector<unsigned int>& SeekPositions;
if(!fs.Find(patternMol, SeekPositions, MaxCandidates))
for(itr=SeekPositions.begin();itr!=SeekPositions.end();++itr)
{
datastream.seekg(*itr);
... read the candidate molecule
and subject to more rigorous test if necessary
}

To do a similarity search based on the Tanimoto coefficient This is defined as:
Number of bits set in (patternFP & targetFP)/Number of bits in (patternFP | targetFP)
where the boolean operations between the fingerprints are bitwise
The Tanimoto coefficient has no absolute meaning and depends on the design of the fingerprint.

multimap<double, unsigned int> SeekposMap;
// to find n best molecules
fs.FindSimilar(&patternMol, SeekposMap, n);
...or
// to find molecules with Tanimoto coefficient > MinTani
fs.FindSimilar(&patternMol, SeekposMap, MinTani);
multimap<double, unsigned int>::reverse_iterator itr;
for(itr=SeekposMap.rbegin();itr!=SeekposMap.rend();++itr)
{
datastream.seekg(itr->second);
// ... read the candidate molecule
double tani = itr->first;
}

The FastSearchFormat class facilitates the use of these routine from the command line or other front end program. For instance:

Prepare an index:

obabel datafile.xxx -O index.fs

With options you can specify:

Substructure search in a previously prepared index file

obabel index.fs -O outfile.yyy -sSMILES

The datafile can also be used as the input file, provided the input format is specified as fs

obabel datafile.xxx -O outfile.yyy -sSMILES -ifs

A "slow" search not using fingerprints would be done on the same data by omitting -ifs. A "slow" search can use SMARTS strings, but the fast search is restricted to the subset which is valid SMILES.

With the -S option, the target can be specified as a molecule in a file of any format

obabel datafile.xxx -O outfile.yyy -Spattern_mol.zzz -ifs

These searches have two stages: a fingerprint search which produces a number of candidate molecules and a definitive search which selects from these using SMARTS pattern matching. The maximum number of candidates is 4000 by default but you can change this with an option e.g. -al 8000 (Note that you need the space between l and the number.) If the fingerprint search reaches the maximum number it will not look further and will tell you at what stage it stopped.

Similarity search in a previously prepared index file
This rather done (rather than a substructure search) if the -at option is used,

obabel datafile.xxx -O outfile.yyy -sSMILES -ifs -at0.7

for instance

All stages, the indexing, the interpretation of the SMILES string in the -s option, the file in the -S option and the final SMARTS filter convert to OBMol and apply ConvertDativeBonds(). This ensures thatforms such as[N+]([O-])=O and N(=O)=O are recognized as chemically identical.

Constructor & Destructor Documentation

◆ ~FastSearch()

virtual ~FastSearch ( )
inlinevirtual

Member Function Documentation

◆ ReadIndexFile()

string ReadIndexFile ( std::string  IndexFilename)

Loads an index from a file and returns the name of the datafile.

◆ ReadIndex()

string ReadIndex ( std::istream *  pIndexstream)

◆ Find()

bool Find ( OBBase pOb,
std::vector< unsigned long > &  SeekPositions,
unsigned int  MaxCandidates 
)

Does substructure search and returns vector of the file positions of matches.

Finds chemical objects in datafilename (which must previously have been indexed) that have all the same bits set in their fingerprints as in the fingerprint of a pattern object. (Usually a substructure search in molecules.) The type of fingerprint and its degree of folding does not have to be specified here because the values in the index file are used. The positions of the candidate matching molecules in the original datafile are returned.

◆ FindMatch()

bool FindMatch ( OBBase pOb,
std::vector< unsigned long > &  SeekPositions,
unsigned int  MaxCandidates 
)

Similar to Find() but all bits of matching fingerprints have to be the same.

Since
version 2.1

◆ FindSimilar() [1/2]

bool FindSimilar ( OBBase pOb,
std::multimap< double, unsigned long > &  SeekposMap,
double  MinTani,
double  MaxTani = 1.1 
)
Returns
A multimap containing objects whose Tanimoto coefficients with the target is greater than the value specified.

◆ FindSimilar() [2/2]

bool FindSimilar ( OBBase pOb,
std::multimap< double, unsigned long > &  SeekposMap,
int  nCandidates = 0 
)
Returns
A multimap containing the nCandidates objects with largest Tanimoto coefficients with the target.

If nCandidates is zero or omitted the original size of the multimap is used

◆ GetFingerprint()

OBFingerprint* GetFingerprint ( ) const
inline
Returns
a pointer to the fingerprint type used to constuct the index

◆ GetIndexHeader()

const FptIndexHeader& GetIndexHeader ( ) const
inline
Returns
a pointer to the index header containing size info etc.

The documentation for this class was generated from the following files: