Propsal for new stereochemistry implementation
colidelbotro This page isn't ready yet, I've just put it up to edit and format it.
The current handling of stereochemistry in OpenBabel is a bit disorganized, and there is a lack of clear documentation. Many of the bugs that are arising in v2.0 arise from the stereochemistry. It is not very clear when and how the overlapping 3D, 2D and 0D representations are used and on how to treat "either" or "undefined" stereochemistry.
There is a case for completely revising the implementation. This would be done by first writing a specification, which would be definitive. After revision all stereochemistry would conform to this specification, unlike at present where the API has features which are obsolete or undocumented. It would also be an opportunity to design the internal structures so that other kinds of stereochemistry - beyond cis/trans and tetrahedral - would be representable in OpenBabel.
How stereochemical information would be handled
In the same way that OpenBabel allows molecules to be represented with either implicit or explicit hydrogen, stereochemistry should also be handled in more than one way, reflecting the information available and the convenience for the chemist. It is necessary to clearly define when each of these representation is used and how and when they are inter-converted. (The following is written with "is" rather than "could" to avoid rewriting it later, but nothing is yet definite.)
Stereochemical information is carried in 4 different ways, depending on what information is available on input or is needed for processing or output. There is not necessarily any conversion between the types.
- 3D molecules in x,y,z coordinates
- 2D molecules in x,y coordinates + hash/wedge bonds
- 2D molecules (with x,y coordinates) in atom parities.
- 0D molecules in atom parities.
Making the 3D type from the other types is conformer generation; making the 2D type with wedge/hash bonds is layout. Both are non-trivial and are not featured in OpenBabel yet. A subset of 0D to 2D conversion is assigning hash/wedge to bonds and this could be part of core OpenBabel. (Although maybe there may be hidden depths of difficulty to this apparently easy process.)
The conversion of 3D and 2D to 0D (parities) occurs only when a 0D property is requested (lazy evaluation). 0D data for the whole molecule is usually generated all together and stored in the OBMol object. With very large molecules, there may be a case for generating the 0D stereo data only for a specified atom, when it would not be stored in the OBMol.
Any atom parity in a 3D molecule would be ignored. In a 2D molecule, mixing wedge/hash and parity in the same molecule is allowed, with the wedge/hash information having precedence. (Because it is more likely to have been approved by a human.)
Stereochemistry reflecting the configuration around a bond with restricted rotation (mainly cis/trans) is also represented in more than one way.
- 2D,3D molecules in x,y(,z) coordinates
- 0D molecules in a bond parities.
Unlike the current situation, the bond parity would be attached to the bond with restricted rotation, not the adjacent ones. (SMILES uses up/down on adjacent bond which leads to over-complicated processing in molecules with conjugated double bonds - local analysis is not good enough.)
0D parameters are evaluated only when needed (lazy evalution) and this is done for the whole molecule and stored in the OBMol. There could also be a local, non-stored evaluation.
Details of data representation in OBMol
2D tetrahedral stereo
Each bond has a wedge flag and a hash flag. It affects only the stereochemistry of the begin atom of the bond - which is taken to be the pointed end. Flags 0,0 are ordinary (not wedge, not hash)bonds andare the default. 1,1 represents a bond which is "either" - could be wedge or hash (but not ordinary).
0D tetrahedral stereo
Each OBAtom would contain a parity flag and a vector of atom indices of atoms. These atoms would usually be attached to the chiral centre and there would usually be four members, although but there could be 3 if there were an implict hydrogen and for some sulphones, or >5 for some organometallic compounds. The parity flag is 1(true) if, when view from the from the first vector member, the subsequent atoms are in clockwise order.(In future there may some similar convention for other types of stereochemistry).
The vector could be in an OBGenericData structure like the current OBChiralData, but I would suggest that it would be better directly in OBAtom, reflecting its structural significance. OBAtom would contain a member variable which was a pointer to a std::vector<unsigned int>. The vector would be stored on the heap and deleted in the OBAtom's destructor. A NULL pointer would indicate a non-chiral atom. An empty vector would represent an "either" parity.
It would be possible to do without the parity flag, but it simplifies some OBMol operations.
In OBMol there would be Has0DChirality flag which would be set if 0D info had been input or generated from 2D or 3D data. Any such generation would only done if it were unset (lazy evaluation). There would also be a flag which indicated, as a result of this, that the molcule contained chiral atoms.
If there was an implicit hydrogen on a chiral atom the vector would contain 3 atom indices. It would be assumed that the implict hydogen was last. (This is the convention used in MDL files).
0D cis/trans, allene, cummulene, axial
The OBBond would contain a pointer to a std::vector<unsigned int>, which would be NULL for bonds without stereo information. The vector would contain atom indices X, A, B, Y X \ A==B \ X A parity flag would be 1(true) for the structure as shown(trans) and 0 for cis. The atoms could be any with chemical significance, for instance A and B could be the ends of a set of adjacent double bonds in a cumulene, or X,Y could be distant atoms in some axial chiralities. It would be possible for the A and B indices to be atoms other than the atoms of the bond in order to represent some other types of stereochemistry. Any implementation needs to recognize this possibility. Any bond, not just double bonds, can have this information.