Difference between revisions of "Propsal for new stereochemistry implementation"

From Open Babel
Jump to: navigation, search
m (Reverted edit of Lic4tDelbo, changed back to last version by Chrismorl)
Line 1: Line 1:
This page isn't ready yet, I've just put it up to edit and format it.
+
These are just public notes and ideas for v3.0 (should it ever happen) and likely to change. Revised [[User:Chrismorl|Chrismorl]] 08:08, 13 September 2008 (PDT)
  
 
The current handling of stereochemistry in OpenBabel is a bit disorganized, and there is a lack of clear documentation. Many of the bugs that are arising in v2.0 arise from the stereochemistry. It is not very clear when and how the overlapping 3D, 2D and 0D representations are used and on how to treat "either" or "undefined" stereochemistry.   
 
The current handling of stereochemistry in OpenBabel is a bit disorganized, and there is a lack of clear documentation. Many of the bugs that are arising in v2.0 arise from the stereochemistry. It is not very clear when and how the overlapping 3D, 2D and 0D representations are used and on how to treat "either" or "undefined" stereochemistry.   
Line 7: Line 7:
 
==How stereochemical information would be handled==
 
==How stereochemical information would be handled==
  
In the same way that OpenBabel allows molecules to be represented with either implicit or explicit hydrogen, stereochemistry should also be handled in more than one way, reflecting the information available and the convenience for the chemist. It is necessary to clearly define when each of these representation is used and how and when they are inter-converted. (The following is written with "is" rather than "could" to avoid rewriting it later, but nothing is yet definite.)
+
Molecules can have 0D, 2D or 3D stereochemistry.
  
===Tetrahedral stereochemistry===
+
For atoms
 
+
* 3D molecules in x,y,z coordinates.
Stereochemical information is carried in 4 different ways, depending on what information is available on input or is needed for processing or output. There is not necessarily any conversion between the types.
+
* 2D molecules in x,y coordinates + hash/wedge bonds.
 
+
* 3D molecules in x,y,z coordinates
+
* 2D molecules in x,y coordinates + hash/wedge bonds
+
 
* 2D molecules (with x,y coordinates) in atom parities.
 
* 2D molecules (with x,y coordinates) in atom parities.
 
* 0D molecules in atom parities.
 
* 0D molecules in atom parities.
  
Making the 3D type from the other types is conformer generation;
+
For bonds
making the 2D type with wedge/hash bonds is layout. Both are non-trivial
+
* 2D,3D molecules in x,y(,z) coordinates
and are not featured in OpenBabel yet. A subset of 0D to 2D conversion is assigning
+
* 0D molecules in a bond parities.
hash/wedge to bonds and this could be part of core OpenBabel. (Although
+
maybe there may be hidden depths of difficulty to this apparently
+
easy process.)
+
  
The conversion of 3D and 2D to 0D (parities) occurs only when a 0D property is requested
+
The 2D representation is all about display. There may be several different representations each of which may be be helpful to human understanding in certain circumstances, so there is no need to provide a "definitive" 2D representation. The form present when a molecule was input would not normally be changed.
(lazy evaluation). 0D data for the whole molecule is usually generated
+
all together and stored in the OBMol object. With very large molecules,
+
there may be a case for generating the 0D stereo data only for a
+
specified atom, when it would not be stored in the OBMol.
+
  
Any atom parity in a 3D molecule would be ignored.
+
0D representation, on the other hand, has the information reduced to its essentials, and would be used, for instance, to determine molecular uniqueness, so that it needs to be in a non-arbitrary form. If the atoms were put into canonical order, the 0D stereochemical representation should always be the same.
In a 2D molecule, mixing wedge/hash and parity in the same molecule
+
is allowed, with the wedge/hash information having precedence. (Because it is more likely to have been approved by a human.)
+
  
===Cis-Trans Stereochemistry===
+
If there was a "Perception" event for a molecule, in which the implicit hydrogen, aromaticity etc., was sorted out, then it would be reasonable to also generate 0D stereo information from any 2D or 3D information present.
  
Stereochemistry reflecting the configuration around a bond with
+
Generating 2D and 3D representations from the 0D one is moving from "definitive" information to something with a range of possibilities, and would be done only on demand.
restricted rotation (mainly cis/trans) is also represented in more than
+
one way.
+
  
* 2D,3D molecules in x,y(,z) coordinates
 
* 0D molecules in a bond parities.
 
 
Unlike the current situation, the bond parity would be attached to the
 
bond with restricted rotation, not the adjacent ones. (SMILES uses up/down on adjacent bond which leads to over-complicated processing in molecules with conjugated double bonds - local analysis is not good enough.)
 
  
0D parameters are evaluated only when needed (lazy evalution) and this is done for the whole
 
molecule and stored in the OBMol. There could also be a local, non-stored evaluation.
 
  
 
==Details of data representation in OBMol==
 
==Details of data representation in OBMol==
Line 56: Line 35:
 
stereochemistry of the begin atom of the bond - which is taken to be
 
stereochemistry of the begin atom of the bond - which is taken to be
 
the pointed end.  
 
the pointed end.  
Flags 0,0 are ordinary (not wedge, not hash)bonds andare the default.
+
Flags 0,0 are ordinary (not wedge, not hash)bonds and are the default.
 
1,1 represents a bond which is "either" - could be wedge or hash  
 
1,1 represents a bond which is "either" - could be wedge or hash  
 
(but not ordinary).
 
(but not ordinary).
  
 
===0D tetrahedral stereo===
 
===0D tetrahedral stereo===
Each OBAtom would contain a parity flag and a vector of atom indices
+
Each OBAtom would contain a stereo-chemical type number. (Tetrahedral, by far the most common, would be 0.) Each atom would have  a small bitset and each type would define what the bits meant, which would be based on the path followed when you traversed the neighbouring atoms in index order. Any implicit hydrogen would be last. This is the convention used in MDL files. Making the hydrogen explicit would not change the order. Similarly 'missing' atom in sulphones or other similar structures would also be regarded as  having a high index.  
of atoms. These atoms would usually be attached to the chiral centre and  
+
there would usually be four members, although but there could be 3 if
+
there were an implict hydrogen and for some sulphones, or >5 for some
+
organometallic compounds. The parity flag is 1(true) if, when view
+
from the from the first vector member, the subsequent atoms are in
+
clockwise order.(In future there may some similar convention for other
+
types of stereochemistry).
+
  
The vector could be in an OBGenericData structure like the current
+
So for the tetrahedral type, a 'parity' bit would be 1 when looking down from the first atom the path through the other three was clockwise. A second 'chiral' bit would be 1 if the chirality was defined.  
OBChiralData, but I would suggest that it would be better directly in
+
chiral parity
OBAtom, reflecting its structural significance. OBAtom would contain
+
  1      0    anticlockwise parity
a member variable which was a pointer to a std::vector<unsigned int>.
+
  1      1    clockwise parity
The vector would be stored on the heap and deleted in the OBAtom's
+
  0      0    undefined
destructor. A NULL pointer would indicate a non-chiral atom. An empty
+
  0      1    either
vector would represent an "either" parity.
+
The 'either' type seems to be required for some purposes.
  
It would be possible to do without the parity flag, but it simplifies
+
===Other types of atom stereochemistry===
some OBMol operations.
+
  
In OBMol there would be Has0DChirality flag which would be set if
+
Allene and other structures with an odd number of C atoms connected by double bonds is treated analogously to tetrahedral stereochemistry by SMILES and OB should do the same: label the central atom as if the other doubly bonded carbons were not there.
0D info had been input or generated from 2D or 3D data. Any such
+
generation would only done if it were unset (lazy evaluation).
+
There would also be a flag which indicated, as a result of this, that
+
the molcule contained chiral atoms.
+
  
 +
For square planar two parity bits would be needed for a U, W or Z shaped path (see [http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html#RTFToC25]. There could be one 'chiral' bit and possibly more if more subtle degrees of uncertainty were required (unlikely).
 +
 
If there was an implicit hydrogen on a chiral atom the vector would
 
If there was an implicit hydrogen on a chiral atom the vector would
 
contain 3 atom indices. It would be assumed that the implict hydogen
 
contain 3 atom indices. It would be assumed that the implict hydogen
Line 92: Line 61:
  
 
===0D cis/trans, allene, cummulene, axial===
 
===0D cis/trans, allene, cummulene, axial===
The OBBond would contain a pointer to a std::vector<unsigned int>,
+
 
 +
Unlike the current situation, the bond stereo information would be attached to the
 +
bond with restricted rotation, not the adjacent ones. (SMILES uses up/down on adjacent bond which leads to over-complicated processing in molecules with conjugated double bonds - local analysis is not good enough.)
 +
 
 +
The OBBond would contain a pointer to a std::vector of atom pointers,
 
which would be NULL for bonds without stereo information. The vector
 
which would be NULL for bonds without stereo information. The vector
 
would contain atom indices X, A, B, Y
 
would contain atom indices X, A, B, Y
X
+
<pre>
\
+
X
 +
  \
 
  A==B
 
  A==B
 
      \
 
      \
 
        X
 
        X
 +
</pre>
 
A parity flag would be 1(true) for the structure as shown(trans) and
 
A parity flag would be 1(true) for the structure as shown(trans) and
 
0 for cis. The atoms could be any with chemical significance, for
 
0 for cis. The atoms could be any with chemical significance, for
 
instance A and B could be the ends of a set of adjacent double bonds
 
instance A and B could be the ends of a set of adjacent double bonds
 
in a cumulene, or X,Y could be distant atoms in some axial chiralities.
 
in a cumulene, or X,Y could be distant atoms in some axial chiralities.
It would be possible for the A and B indices to be atoms other than
+
It would be possible for the A and B atoms to be other than
 
the atoms of the bond in order to represent some other types of  
 
the atoms of the bond in order to represent some other types of  
 
stereochemistry. Any implementation needs to recognize this possibility.
 
stereochemistry. Any implementation needs to recognize this possibility.
 
Any bond, not just double bonds, can have this information.
 
Any bond, not just double bonds, can have this information.

Revision as of 08:08, 13 September 2008

These are just public notes and ideas for v3.0 (should it ever happen) and likely to change. Revised Chrismorl 08:08, 13 September 2008 (PDT)

The current handling of stereochemistry in OpenBabel is a bit disorganized, and there is a lack of clear documentation. Many of the bugs that are arising in v2.0 arise from the stereochemistry. It is not very clear when and how the overlapping 3D, 2D and 0D representations are used and on how to treat "either" or "undefined" stereochemistry.

There is a case for completely revising the implementation. This would be done by first writing a specification, which would be definitive. After revision all stereochemistry would conform to this specification, unlike at present where the API has features which are obsolete or undocumented. It would also be an opportunity to design the internal structures so that other kinds of stereochemistry - beyond cis/trans and tetrahedral - would be representable in OpenBabel.

How stereochemical information would be handled

Molecules can have 0D, 2D or 3D stereochemistry.

For atoms

  • 3D molecules in x,y,z coordinates.
  • 2D molecules in x,y coordinates + hash/wedge bonds.
  • 2D molecules (with x,y coordinates) in atom parities.
  • 0D molecules in atom parities.

For bonds

  • 2D,3D molecules in x,y(,z) coordinates
  • 0D molecules in a bond parities.

The 2D representation is all about display. There may be several different representations each of which may be be helpful to human understanding in certain circumstances, so there is no need to provide a "definitive" 2D representation. The form present when a molecule was input would not normally be changed.

0D representation, on the other hand, has the information reduced to its essentials, and would be used, for instance, to determine molecular uniqueness, so that it needs to be in a non-arbitrary form. If the atoms were put into canonical order, the 0D stereochemical representation should always be the same.

If there was a "Perception" event for a molecule, in which the implicit hydrogen, aromaticity etc., was sorted out, then it would be reasonable to also generate 0D stereo information from any 2D or 3D information present.

Generating 2D and 3D representations from the 0D one is moving from "definitive" information to something with a range of possibilities, and would be done only on demand.


Details of data representation in OBMol

2D tetrahedral stereo

Each bond has a wedge flag and a hash flag. It affects only the stereochemistry of the begin atom of the bond - which is taken to be the pointed end. Flags 0,0 are ordinary (not wedge, not hash)bonds and are the default. 1,1 represents a bond which is "either" - could be wedge or hash (but not ordinary).

0D tetrahedral stereo

Each OBAtom would contain a stereo-chemical type number. (Tetrahedral, by far the most common, would be 0.) Each atom would have a small bitset and each type would define what the bits meant, which would be based on the path followed when you traversed the neighbouring atoms in index order. Any implicit hydrogen would be last. This is the convention used in MDL files. Making the hydrogen explicit would not change the order. Similarly 'missing' atom in sulphones or other similar structures would also be regarded as having a high index.

So for the tetrahedral type, a 'parity' bit would be 1 when looking down from the first atom the path through the other three was clockwise. A second 'chiral' bit would be 1 if the chirality was defined. chiral parity

 1      0    anticlockwise parity
 1      1    clockwise parity
 0      0    undefined
 0      1    either

The 'either' type seems to be required for some purposes.

Other types of atom stereochemistry

Allene and other structures with an odd number of C atoms connected by double bonds is treated analogously to tetrahedral stereochemistry by SMILES and OB should do the same: label the central atom as if the other doubly bonded carbons were not there.

For square planar two parity bits would be needed for a U, W or Z shaped path (see [1]. There could be one 'chiral' bit and possibly more if more subtle degrees of uncertainty were required (unlikely).

If there was an implicit hydrogen on a chiral atom the vector would contain 3 atom indices. It would be assumed that the implict hydogen was last. (This is the convention used in MDL files).

0D cis/trans, allene, cummulene, axial

Unlike the current situation, the bond stereo information would be attached to the bond with restricted rotation, not the adjacent ones. (SMILES uses up/down on adjacent bond which leads to over-complicated processing in molecules with conjugated double bonds - local analysis is not good enough.)

The OBBond would contain a pointer to a std::vector of atom pointers, which would be NULL for bonds without stereo information. The vector would contain atom indices X, A, B, Y

	 X
	  \
	   A==B
	       \
	        X

A parity flag would be 1(true) for the structure as shown(trans) and 0 for cis. The atoms could be any with chemical significance, for instance A and B could be the ends of a set of adjacent double bonds in a cumulene, or X,Y could be distant atoms in some axial chiralities. It would be possible for the A and B atoms to be other than the atoms of the bond in order to represent some other types of stereochemistry. Any implementation needs to recognize this possibility. Any bond, not just double bonds, can have this information.