Radicals and SMILES extensions

From Open Babel
Revision as of 01:12, 19 September 2007 by Chrismorl (Talk | contribs) (Remove C[O*])

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The need for radicals and implicit hydrogen to coexist

Hydrogen deficient molecules, radicals, carbenes, etc., are not well catered for by chemical software aimed at pharmacuticals. But radicals are important reaction intermediates in living systems as well as many other fields, such as polymers, paints, oils, combustion and atmospheric chemistry. The examples given here are small molecules, relevant to the last two applications.

Chemistry software to handle radicals is complicated by the common use of implicit hydrogen when describing molecules. How is the program to know when you type "O" whether you mean an oxygen atom or water? This ambiguity leads some to say that hydrogens should always be explicit in any chemical description. But this is not the way that most chemists work. A straight paraffinic chain from which a hydrogen had been abstracted might commonly be represented by something like:

3-pentyl radical

This uses implicit hydrogens and an explicit radical centre. But sometimes the hydrogens are explicit and the radical centre implicit, as in CH3 - the methyl radical.

How OpenBabel does it

OpenBabel accepts molecules with explicit or implicit hydrogens and can convert between the two. It will also handle radicals (and other hydrogen-deficient species) with implicit hydrogen by using internally a property of an atom, _spinmultiplicity, modelled on the RAD property in MDL MOL files and also used in CML. This can be regarded in the present context as a measure of the hydrogen deficiency of an atom. Its value is:

0 for normal atoms,
2 for radical (missing one hydrogen) and
1 or 3 for carbenes and nitrenes (missing two hydrogens).

It happens that for some doubly deficient species, like carbene CH2 and oxygen atoms, the singlet and triplet species are fairly close in energy and both may be significant in certain applications such as combustion, atmospheric or preparative organic chemistry, so it is convenient that they can be described separately. There are of course an infinity of other electronic configurations of molecules but OpenBabel has no special descriptors for them. However, even more hydrogen-deficient atoms are indicated by the highest possible value of spinmultiplicity (C atom has spin multiplicity of 5). (This extends MDL's RAD property which has a maximum value of 3.)

If the spin multiplicity of an atom is not input explicitly, it is set (in OBMol::AssignSpinMultiplicity()) when the input format is mol, smi, cml or therm. This routine is called after all the atoms and bonds of the molecule are known. It detects hydrogen deficiency in an atom and assigns spin multiplicity appropriately. But because hydrogen may be implicit it only does this for atoms which have at least one explicit hydrogen or on atoms which have had ForceNoH()called for them - which is effectively zero explicitly hydrogens. The latter is used, for instance, when SMILES inputs [O] to ensure that it is seen as an oxygen atom (spin multiplicity=3) rather than water. Otherwise, atoms with no explicit hydrogen are assumed to have a spin multiplicity of 0, i.e with full complement of implicit hydrogens.

In deciding which atoms should be have spin multiplicity assigned, hydrogen atoms which have an isotope specification (D,T or even 1H) do not count. So SMILES N[2H] is NH2D (spin multiplicity left at 0, so with a full content of implicit hydrogens), whereas N[H] is NH (spinmultiplicity=3). A deuterated radical like NHD is represented by [NH][2H].

In radicals either the hydrogen or the spin multiplicity can be implicit

Once the spin multiplicity has been set on an atom, the hydrogens can be implicit even if it is a radical. For instance, the following mol file, with explicit hydrogens, is one way of representing the ethyl radical:

ethyl radical
Has explicit hydrogen and implicit spin multiplicity
  7  6  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0
    0.0000    0.0000    0.0000 H   0  0  0  0  0
  1  2  1  0  0  0
  1  3  1  0  0  0
  1  4  1  0  0  0
  1  5  1  0  0  0
  2  6  1  0  0  0
  2  7  1  0  0  0

When read by OpenBabel the spinmultiplicity is set to 2 on the C atom 2. If the hydrogens are made implicit, perhaps by the -d option, and the molecule output again, an alternative representation is produced:

ethyl radical
Has explicit spin multiplicity and implicit hydrogen
  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0
  1  2  1  0  0  0
M  RAD  1   2   2

SMILES extensions for radicals

Although radical structures can be represent in SMILES by specifying the hydrogens explicity, e.g. [CH3] is the methyl radical, some chemists have apparently felt the need to devise non-standard extensions which represent the radical centre explicitly. OpenBabel will recognize C[O.] as well as C[O] as the methoxy radical CH3O, during input, but the non-standard form is not supported in output.

By default, radical centres are output in explict hydrogen form, e.g. C[CH2] for the ethyl radical. All the atoms will be in explict H form, i.e. [CH3][CH2], if AddHydrogens() or the -h option has been specified. The output is always standard SMILES, although other programs may not interpret radicals correctly.

OpenBabel supports another SMILES extension for both input and output: the use of lower case atomic symbols to represent radical centres. (This is supported on the ACCORD Chemistry Control and maybe elewhere.) So the ethyl radical is Cc and the methoxy radical is Co This form is input transparently and can be output by using the -xr option "radicals lower case". It is a useful shorthand in writing radicals, and in many cases is easier to read since the emphasis is on the radical centre rather than the number of hydrogens which is less chemically significant.

In addition, this extension interprets multiple lower case c without ring closure as a conjugated carbon chain, so that cccc is input as 1,3 butadiene. Lycopene (the red in tomatoes) is Cc(C)cCCc(C)cccc(C)cccc(C)ccccc(C)cccc(C)cccc(C)CCcc(C)C (without the stereo chemical specifications). This conjugated chain form is not used on output - except in the standard SMILES aromatic form, c1ccccc1 benzene.

It is interesting that the lower case extension actually improve the chemical representation in a few cases. The allyl radical C3H5 would be conventionally [CH2]=[CH][CH2] (in its explict H form), but could be represented as ccc with the extended syntax. The latter more accurately represents the symmetry of the molecule caused by delocalisation.

This extension is not as robust or carefully considered as standard SMILES and should be used with restraint. A structure which used c as a radical centre close to aromatic carbons can be confusing to read, and OpenBabel's SMILES parser can also be confused. It recognizes c1ccccc1c as the benzyl radical, but it doesn't like c1cc(c)ccc1. Radical centres should not be involved in ring closure: for cyclohexyl radical C1cCCCC1 is ok, c1CCCCC1 is not.