FastSearch Class Reference
Class to search fingerprint index files. More...
#include <openbabel/fingerprint.h>
Public Member Functions | |
| std::string | ReadIndexFile (std::string IndexFilename) |
| std::string | ReadIndex (std::istream *pIndexstream) |
| virtual | ~FastSearch () |
| bool | Find (OBBase *pOb, std::vector< unsigned int > &SeekPositions, unsigned int MaxCandidates) |
| bool | FindMatch (OBBase *pOb, std::vector< unsigned int > &SeekPositions, unsigned int MaxCandidates) |
| bool | FindSimilar (OBBase *pOb, std::multimap< double, unsigned int > &SeekposMap, double MinTani) |
| bool | FindSimilar (OBBase *pOb, std::multimap< double, unsigned int > &SeekposMap, int nCandidates=0) |
| OBFingerprint * | GetFingerprint () const |
| const FptIndexHeader & | GetIndexHeader () const |
Detailed Description
Class to search fingerprint index files.The FastSearch class searches an index to a datafile containing a list of molecules (or other objects) prepared by FastSearchIndexer.
OpenBabel can also search files for molecules containing a substructure specified by a SMARTS string, using OBSmartsPattern or from the command line:
babel datafile.xxx -outfile.yyy -sSMARTS
Note that the index files are not portable. They need to be prepared on the computer that will access them.
Using FastSearch and FastSearchIndexer in a program
The index has two tables:
- an array of fingerprints of the molecules
- an array of the seek positions in the datasource file of all the molecules
To prepare an fastsearch index file:
- Open an ostream to the index file.
- Make a FastSearchIndexer object on the heap or the stack, passing in as parameters:
- the datafile name, the indexfile stream,
- the id of the fingerprint type to be used,
- the number of bits the fingerprint is to be folded down to, If it is to be left unfolded, set fpid to 0 or do not specify it.
- For each molecule, call Add() with its pointer and its position in the datafile.
Currently the std::streampos value is implicitly cast to unsigned int so that for 32bit machines the datafile has to be no longer than about 2Gbyte. - The index file is written when the FastSearchIndexer object is deleted or goes out of scope.
To search in a fastsearch index file
- Open a std::istream to the indexfile (in binary mode on some systems)
- Make a FastSearch object, read the index and open the datafile whose name it provides
ifstream ifs(indexname,ios::binary); FastSearch fs; string datafilename = fs.ReadIndex(&ifs); if(datafilename.empty() return false; ifstream datastream(datafilename); if(!datastream) return false;
To do a search for molecules which have all the substructure bits the OBMol object, patternMol
vector<unsigned int>& SeekPositions;
if(!fs.Find(patternMol, SeekPositions, MaxCandidates))
for(itr=SeekPositions.begin();itr!=SeekPositions.end();++itr)
{
datastream.seekg(*itr);
... read the candidate molecule
and subject to more rigorous test if necessary
}
To do a similarity search based on the Tanimoto coefficient This is defined as:
Number of bits set in (patternFP & targetFP)/Number of bits in (patternFP | targetFP)
where the boolean operations between the fingerprints are bitwise
The Tanimoto coefficient has no absolute meaning and depends on the design of the fingerprint.
multimap<double, unsigned int> SeekposMap;
// to find n best molecules
fs.FindSimilar(&patternMol, SeekposMap, n);
...or
// to find molecules with Tanimoto coefficient > MinTani
fs.FindSimilar(&patternMol, SeekposMap, MinTani);
multimap<double, unsigned int>::reverse_iterator itr;
for(itr=SeekposMap.rbegin();itr!=SeekposMap.rend();++itr)
{
datastream.seekg(itr->second);
// ... read the candidate molecule
double tani = itr->first;
}
The FastSearchFormat class facilitates the use of these routine from the command line or other front end program. For instance:
Prepare an index:
babel datafile.xxx index.fs
babel datafile.xxx -ofs namedindex
- which type of fingerprint to be used, e.g. -xfFP2,
- whether it is folded to a specified number of bits, e.g. -xn128 (which should be a power of2)
- whether to pre-select the molecules which are indexed:
- by structure e.g only ethers and esters, -sCOC
- by excluding molecules with bezene rings, -vc1ccccc1
- by position in the datafile e.g. only mols 10 to 90, -f10 -l90
babel index.fs outfile.yyy -sSMILES
babel datafile.xxx outfile.yyy -sSMILES -ifs
With the -S option, the target can be specified as a molecule in a file of any format
babel datafile.xxx outfile.yyy -Spattern_mol.zzz -ifs
Similarity search in a previously prepared index file
This rather done (rather than a substructure search) if the -at option is used,
babel datafile.xxx outfile.yyy -sSMILES -ifs -at0.7
- -at0.7 will recover all molecules with Tanimoto greater than 0.7
- -at15 (no decimal point) will recover the 15 molecules with largest coefficients.
- -aa will add the Tanimoto coefficient to the titles of the output molecules.
All stages, the indexing, the interpretation of the SMILES string in the -s option, the file in the -S option and the final SMARTS filter convert to OBMol and apply ConvertDativeBonds(). This ensures thatforms such as[N+]([O-])=O and N(=O)=O are recognized as chemically identical.
Constructor & Destructor Documentation
| virtual ~FastSearch | ( | ) | [inline, virtual] |
Member Function Documentation
| string ReadIndexFile | ( | std::string | IndexFilename | ) |
Loads an index from a file and returns the name of the datafile.
| string ReadIndex | ( | std::istream * | pIndexstream | ) |
Referenced by FastSearch::ReadIndexFile().
| bool Find | ( | OBBase * | pOb, | |
| std::vector< unsigned int > & | SeekPositions, | |||
| unsigned int | MaxCandidates | |||
| ) |
Does substructure search and returns vector of the file positions of matches.
Finds chemical objects in datafilename (which must previously have been indexed) that have all the same bits set in their fingerprints as in the fingerprint of a pattern object. (Usually a substructure search in molecules.) The type of fingerprint and its degree of folding does not have to be specified here because the values in the index file are used. The positions of the candidate matching molecules in the original datafile are returned.
| bool FindMatch | ( | OBBase * | pOb, | |
| std::vector< unsigned int > & | SeekPositions, | |||
| unsigned int | MaxCandidates | |||
| ) |
| bool FindSimilar | ( | OBBase * | pOb, | |
| std::multimap< double, unsigned int > & | SeekposMap, | |||
| double | MinTani | |||
| ) |
- Returns:
- A multimap containing objects whose Tanimoto coefficients with the target is greater than the value specified.
| bool FindSimilar | ( | OBBase * | pOb, | |
| std::multimap< double, unsigned int > & | SeekposMap, | |||
| int | nCandidates = 0 | |||
| ) |
- Returns:
- A multimap containing the nCandidates objects with largest Tanimoto coefficients with the target.
If nCandidates is zero or omitted the original size of the multimap is used
| OBFingerprint* GetFingerprint | ( | ) | const [inline] |
- Returns:
- a pointer to the fingerprint type used to constuct the index
| const FptIndexHeader& GetIndexHeader | ( | ) | const [inline] |
- Returns:
- a pointer to the index header containing size info etc.
The documentation for this class was generated from the following files:

