Cheminformatics 101

From Open Babel
Revision as of 09:40, 2 November 2006 by Ghutchis (Talk | contribs) (Import of Craig's excellent primer)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

An introduction to the Computer Science and Chemistry of Chemical Information Systems

by Craig A. James, eMolecules, Inc.


  1. Cheminformatics Basics
  2. Representing Molecules
  3. Substructure Searching with Indexes
  4. Molecular Similarity
  5. Chemical Registration Systems

1. Cheminformatics Basics

What is Cheminformatics?

Cheminformatics is a cross between Computer Science and Chemistry: The process of storing and retrieving information about chemical compounds.

Information Systems are concerned with storing, retrieving, and searching information, and with storing relationships between bits of data. For example:

Operation Classical Information


Chemical Information System
Store Name = 'Jimmy Carter' Stores text, numbers, dates, ... Steroid2.gif Stores chemical compounds and information about them.
Retrieve Find record #13282 Retrieves 'Jimmy Carter' Find:


Search Find Presidents named 'Bush' George Bush and George W. Bush Find molecules containing:
Steroid2 matched.gif
Relationship Year Carter was elected Answer: Elected in 1976 What's the logP(o/w) of:
logP(o/w) = 2.62

How is Cheminformatics Different?

There are four key problems a cheminformatics system solves:

1. Store a Molecule Computer scientists usually use the valence model of chemistry to represent compounds. Section 2, Representing Molecules, discusses this at length.
2. Find exact molecule If you ask, "Is Abraham Lincoln in the database?" it's not hard to find the answer. But, given a specific molecule, is it in the database? What do we know about it? This may seem seem simple at first glance, but it's not, as we'll see when we discuss tautomers, stereochemistry, metals, and other "flaws" in the valence model of chemistry.
3. Substructure search If you ask, "Is anyone named Lincoln in the database?" you usually expect to find the former President and a number of others - this is called a search rather than a lookup. For a chemical informatics system, we have a substructure search: Find

all molecules containing a partial molecule (the "substructure") drawn by the user. The substructure is usually a functional group, "scaffold", or core structure representing a class of molecules. This too is a hard problem, much harder than most text searches, for reasons that go to the very root of mathematics and the theory of computability.

4. Similarity search Some databases can find similar-sounding or misspelled words, such as "Find Lincon" or "find Cincinati", which respectively might find Abraham Lincoln and Cincinnati. Many chemical information systems can find molecules similar to a given molecule, ranked by similarity. There are several ways to measure molecular similarity, discussed further in Section 4, Molecular Similarity.

Next: Chapter 2: Representing Molecules