The base class for fingerprints.
These fingerprints are condensed representation of molecules (or other objects) as a list of boolean values (actually bits in a vector<unsigned>) with length of a power of 2. The main motivation is for fast searching of data sources containing large numbers of molecules (up to several million). Open Babel provides some routines which can search text files containing lists of molecules in any format. See the documentation on the class FastSearch.
There are descriptions of molecular fingerprints at
http://www.daylight.com/dayhtml/doc/theory/theory.finger.html) and
http://www.mesaac.com/Fingerprint.htm
Many methods of preparing fingerprints have been described, but the type supported currently in OpenBabel has each bit representing a substructure (or other molecular property). If a substructure is present in the molecule, then a particular bit is set to 1. But because the hashing method may also map other substructures to the same bit, a match does not guarantee that a particular substructure is present; there may be false positives. However, with proper design, a large fraction of irrelevant molecules in a data set can be eliminated in a fast search with boolean methods on the fingerprints. It then becomes feasible to make a definitive substructure search by conventional methods on this reduced list even if it is slow.
OpenBabel provides a framework for applying new types of fingerprints without changing any existing code. They are derived from OBFingerprint and the source file is just compiled with the rest of OpenBabel. Alternatively, they can be separately compiled as a DLL or shared library and discovered when OpenBabel runs.
For more on these specific implementations of fingerprints in Open Babel, please take a look at the developer's wiki: http://openbabel.org/wiki/Fingerprints
Fingerprints derived from this abstract base class OBFingerprint can be for any object derived from OBBase (not just for OBMol). Each derived class provides an ID as a string and OBFingerprint keeps a map of these to provides a pointer to the class when requested in FindFingerprint.
– To define a fingerprint type –
The classes derived form OBFingerprint are required to provide a GetFingerprint() routine and a Description() routine
class MyFpType : OBFingerprint
{
MyFpType(const char* id) : OBFingerprint(id){};
virtual bool GetFingerprint(OBBase* pOb, vector<unsigned int>& fp,
int nbits)
{
OBMol* pmol = dynamic_cast<OBMol*>(pOb);
fp.resize(required_number_of_words);
...
use
SetBit(fp,n); to
set the nth bit
if(nbits)
}
virtual const char*
Description(){
return "Some descriptive text";}
...
};
Declare a global instance with the ID you will use in -f options to specify its use.
MyFpType theMyFpType("myfpID");
– To obtain a fingerprint –
OBMol mol;
...
vector<unsigned int> fp;
OBFingerprint::GetDefault()->GetFingerprint(&mol, fp);
or
vector<unsigned int> fp;
...and maybe...
pFP->GetFingerprint(&mol,fp, 128);
– To print a list of available fingerprint types –
std::string id;
OBFingerPrint* pFPrt=NULL;
while(OBFingerprint::GetNextFPrt(id, pFPrt))
{
cout << id << " -- " << pFPrt->Description() << endl;
}
Fingerprints are handled as vector<unsigned int> so that the number of bits in this vector and their order will be platform and compiler dependent, because of size of int types and endian differences. Use fingerprints (and fastsearch indexes containing them) only for comparing with other fingerprints prepared on the same machine.
The FingerprintFormat class is an output format which displays fingerprints as hexadecimal. When multiple molecules are supplied it will calculate the Tanimoto coefficient from the first molecule to each of the others. It also shows whether the first molecule is a possible substructure to all the others, i.e. whether all the bits set in the fingerprint for the first molecule are set in the fingerprint of the others. To display hexadecimal information when multiple molecules are provided it is necessay to use the -xh option.
To see a list of available format types, type obabel -F on the command line. The -xF option of the FingerprintFormat class also provides this output, but due to a quirk in the way the program works, it is necessary to have a valid input molecule for this option to work.