Canonical Coding Algorithm
The aim of a canonical coding algorithm is to assign unique labels to atoms regardless of their input order. In practise, these labels or orders are atom indexes resulting from the order in which the atoms are defined in an input file. Although most chemical file formats could be used to store canonical ordered molecules, canonical single line notations are used more often since they allow two canonical molecules to be compared using a string comparison. Two well known examples are canonical smiles and inchi. While the canonical smiles for the same molecule should always be the same when using a specific implementation (i.e. toolkits), these canonical smiles are usually not transferable between implementations. There is only one inchi implementation and the canonical code, although not formally specified is always the same. A typical use case for canonical codes is to determine if a molecule is already in a database.
The rest of this page documents the OpenBabel canonical coding algorithm.
The starting point of the canonical coding algorithm is the topological symmetry or graph symmetry. This symmetry can be expressed by assigning symmetry classes to atoms. Symmetric atoms have the same symmetry class. The term color is often used in graph theory as a synonym for symmetry classes.
Symmetry classes are used as graph invariants to order atoms. The symmetry classes can be used in two ways:
- lowest symmetry class = lowest label (i.e. 1)
- highest symmetry class = lowest label
Both methods are valid. OpenBabel uses the second method. In the following sections there are more of these decisions that have to be made. The first, choice is used as example to illustrate that it doesn't make any real difference as long as the same choice is used every time. When trying to reproduce OpenBabel canonical smiles the same choices or conventions have to be used.
The used initial graph invariant is specified below and the iteration algorithm is similar to the original Morgan algorithm. The OBGraphSym class should be consulted for a detailed description.
0 1 2 3 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 GI = D D D D D D D D D D D V V V V A R N N N N N N N B B B B C C C C GI: Graph invariant (32 bits) D: Graph theoretical distance (10 bits) V: Heavy valence (4 bits) A: Aromaticity (1 bit) R: Ring atom (1 bit) N: Atomic number (7 bits) B: Heavy bond sum (4 bits) C: Formal charge + 7 (4 bits)
(FIXME: ugly smiles)
The symmetry classes encode various attributes for each atom. Consult the OBGraphSym documentation for details. (note: when trying to reproduce OpenBabel canonical smiles, make sure these graph invariants are the same)
The labeling starts by selecting the initial atom using a set of rules to ensure it starts at terminal atoms. (FIXME) Label 1 is assigned to this atom and the labeling continues using this atom as the current atom:
- Find the neighbor atoms of the current atom to label next. Already labeled and atoms not in the fragment are ignored. If there are no such atoms, go to 3.
- Sort the neighbors by their symmetry classes. Each group of atoms with the same symmetry class is now handled in sequence. If there is only one atom in the group, this atom is labeled and the process continues with the next group. If there are multiple (n) atoms in a group, all n! permutations of the group are tried. (note: The Optimization section below mentions some exceptions which affect the final result. These optimizations are part of the algorithm)
- When all neighbor atoms of the current atom are labeled, the next atom (i.e. the atom with current_label+1 label) is considered the current atom and step 1 through 3 are repeated. If there is no next atom, the labeling is done.
The algorithm just described results in a set of labellings that are candidates for the canonical labeling. For each labeling, a code is generated to select the canonical code and the associated labels.
A canonical code can be anything encoding the molecular graph with all the properties that should be canonicalized. The code should be invariant to the original atom order. At this point, it may seem that this is similar to the graph invariants from above (i.e. symmetry classes). However, generating these codes allows for additional attributes to be included. A simple example are isotopes. More importantly, stereochemistry is included in these codes which is an attribute that does not depend an isolated atom.
The actual code is a list of numbers (unsigned int) constructed by joining smaller codes. These smaller codes usually encode specific attributes.
When labeling the molecule as described earlier, all but the initial atom in a connected fragment have a "from" atom (i.e. current atom in the algorithm). The FROM code contains the label of the "from" atom, for each atom (ignoring the initial atom) ordered by their assigned label. The FROM code is a spanning tree containing all bonds that are not ring closures.
The ring closure bonds are easily identified while creating the FROM code. The CLOSURE code contains two labels for each closure bond. The labels are sorted in the following way:
- [1 6] < [6 1] (i.e. 1 6 is used, individual bond sorting)
- [1 3][1 4] < [1 4][1 3] (i.e. 1 3 1 4 is used, sorting bonds)
- [1 3][5 7] < [5 7][1 3]
The square brackets are only to indicate individual bonds, the code is just a single list.
The ATOM-TYPES code is simply the atomic number for each atom ordered by labels. If the molecule contains isotopes, an additional ISOTOPES code is used to correctly canonicalize these molecules.
The BOND-TYPES code contains, for each bond, the bond order or 5 if the bond is aromatic. The bonds are ordered using the order obtained during the FROM code construction. The closure bonds are last, ordered as described in the CLOSURE code section.
The STEREO code contains a descriptor for each stereo center. The stereo centers are ordered using the central atom(s) (i.e. center for tetrahedral stereochemistry, lowest label for cistrans).
The canonical code is the greatest code (elements are compared, the first difference determines order). The complete canonical code can also be seen as layers. The first layer is the FROM code together with the CLOSURE code. Together these codes describe the graph without node (atom) or edge (bond) properties. The ATOM-TYPES code is the layer adding node properties. The optional ISOTOPES code also encodes node properties. Edge properties are specified in the BOND-TYPES code. The previous codes together are enough to specify the constitution of a molecule. Additional layers such as the STEREO code could be ignored for certain use cases. For example, if two molecules have the same canonical code up to the STEREO part, they are stereoisomers.
Disconnected fragments are handled individually and the resulting canonical codes are sorted. The final canonical code for the whole molecule, are the sorted canonical codes joined together.
This section documents optimizations to reduce the number of states which affect the final result. The aim of these optimizations is to avoid the n! permutations in step 2 of the labeling algorithm. Some of the optimizations affect the final result and should be implemented when trying to reproduce the same canonical code.
If the current atom is not in a ring, it follows that the bonds to all it's neighbors are not ring bonds. It also follows that there is no path between two neighbor atoms with the same symmetry class without going through the current atom. Therefore, these ligands (i.e. the fragments that would be created by deleting the current atom) can be labeled individually. For example, for an atom with 4 neighbors with the same symmetry class, the number of permutations to label is reduced from 4! * 4 = 96 to 4. The times 4 is because after generating the 24 permutations, the ligands are not labeled yet.
- This optimization affects the final result.
If the bond(s) to the neighbor atom(s) with the same symmetry classes is not a ring bond, optimization 1 could be used. This is the more general version of optimization 1.
- This optimization affects the final result but is not implemented.
If the current atom is a stereo center, only permutations with the highest descriptor produce a greatest canonical code. This does not change the final result and can be implemented without changing canonical smiles etc.
In essence, the canonical coding algorithm is very similar to the algorithm from the well-known nauty package [1, 2]. However, the optimizations mentioned above are far easier to implement, document and result in acceptable performance for most structures. This enables anyone to reproduce OpenBabel canonical smiles without the absolute requirement to implement the complex optimizations from nauty. Since the optimizations from the nauty paper do not have any affect on the final result, these can be implemented as required.
The OpenBabel algorithm uses the molecule graph as the main object while searching for a canonical code. Nauty starts from a partition of the set of nodes which is refined. This is not the essential part though. The similarity between the algorithms is that both algorithms share the concept of a search tree. A completed canonical candidate code is a leaf node in this search tree. If a second labeling is found with the same canonical code, there exists a permutation mapping the first labels to the second. This permutation is an automorphism that takes all attributes of the canonical code into account. The found automorphism is not only an automorphism of the molecule but also an isomorphism of the search tree. These accidentally discovered automorphisms can be used to reduce the number of states to consider in various ways.
 Brendan D. McKay, Backtrack programming and isomorphism rejection on ordered subsets, ARS Combinatoria, Vol. 5, 1979, pp. 65-99 http://cs.anu.edu.au/~bdm/papers/backtrack1978.pdf  Brendan D. McKay, Practical graph isomorphism, Congressus Numerantium, Vol. 30, 1981, pp. 45-87 http://cs.anu.edu.au/~bdm/papers/pgi.pdf