Table Of Contents

Previous topic

Install Open Babel

Next topic

The Open Babel GUI

This Page

obabel and babel - Convert, Filter and Manipulate Chemical Data

obabel and babel are cross-platform programs designed to interconvert between many file formats used in molecular modeling and computational chemistry and related areas. They can also be used for filtering molecules and for simple manipulation of chemical data.

Synopsis

  • obabel [-H <help-options>]
  • babel [-H <help-options>]
  • obabel [-i <input-ID>] infile [-o <output-ID>] [-O outfile] [OPTIONS]
  • babel [-i <input-ID>] infile [-o <output-ID>] [outfile] [OPTIONS]

obabel is recommended over babel (see Differences between babel and obabel).

Options

Information and help

  • obabel [-H <help-options>]
  • babel  [-H <help-options>]
-H Output usage information
-H <format-ID> Output formatting information and options for the format specified
-Hall Output formatting information and options for all formats
-L List plugin types (charges, descriptors, fingerprints, forcefields, formats, loaders and ops)
-L <plugin type>
 List plugins of this type. For example, obabel -L formats gives the list of file formats.
-L <plugin-ID> Details of a particular plugin (of any plugin type). For example, obabel -L cml gives details on the CML file format.
-V Output version number

Conversion options

  • obabel [-i <input-ID>] infile [-o <output-ID>] [-O outfile] [OPTIONS]
  • obabel -:"<SMILES string>"    [-o <output-ID>] [-O outfile] [OPTIONS]
  • babel  [-i <input-ID>] infile [-o <output-ID>]    [outfile] [OPTIONS]

Note

If only input and output files are given, Open Babel will guess the file type from the filename extension. For information on the file formats supported by Open Babel, please see Supported File Formats and Options.

-a <options> Format-specific input options. Use -H <format-ID> to see options allowed by a particular format, or see the appropriate section in Supported File Formats and Options.
--add <list> Add properties (for SDF, CML, etc.) from descriptors in list. Use -L descriptors to see available descriptors.
--addinindex Append input index to title (that is, the index before any filtering)
--addoutindex Append output index to title (that is, the index after any filtering)
--addtotitle <text>
 Append the text after each molecule title
--append <list>
 Append properties or descriptor values appropriate for a molecule to its title. For more information, see Append property values to the title.
-b Convert dative bonds (e.g. [N+]([O-])=O to N(=O)=O)
-c Center atomic coordinates at (0,0,0)
-C Combine molecules in first file with others having the same name
-d Delete hydrogens (make all hydrogen implicit)
--delete <list>
 Delete properties in list
-e Continue to convert molecules after errors
---errorlevel <N>
 

Filter the level of errors and warnings displayed:

  • 1 = critical errors only
  • 2 = include warnings too (default)
  • 3 = include informational messages too
  • 4 = include “audit log” messages of changes to data
  • 5 = include debugging messages too
-f <#> For multiple entry input, start import with molecule # as the first entry
--filter <criteria>
 Filter based on molecular properties. See Filtering molecules from a multimolecule file for examples and a list of criteria.
--gen2d Generate 2D coordinates
--gen3d Generate 3D coordinates
-h Add hydrogens (make all hydrogen explicit)
-i <format-ID> Specifies input format. See Supported File Formats and Options.
-j, --join Join all input molecules into a single output molecule entry
-k Translate computational chemistry modeling keywords. See the computational chemistry formats (Computational chemistry formats), for example GAMESS Input (inp, gamin) and Gaussian 98/03 Input (gjf, gjc, gau, com).
-m

Produce multiple output files, to allow:

  • Splitting one input file - put each molecule into consecutively numbered output files
  • Batch conversion - convert each of multiple input files into a specified output format
-l <#> For multiple entry input, stop import with molecule # as the last entry
-o <format-ID> Specifies output format. See Supported File Formats and Options.
-p <pH> Add hydrogens appropriate for pH (use transforms in phmodel.txt)
--partialcharge <charge-method>
 Calculate partial charges by the specified method. List available methods using obabel -L charges.
--property <name value>
 Add or replace a property (for example, in an SD file)
-r Remove all but the largest contiguous fragment (strip salts)
--readconformers
 Combine adjacent conformers in multi-molecule input into a single molecule
-s <SMARTS> Convert only molecules matching the SMARTS pattern specified
-s <filename.xxx>
 Convert only molecules with the molecule in the file as a substructure
--separate Separate disconnected fragments into individual molecular records
--sort Output molecules ordered by the value of a descriptor. See Sorting molecules.
--title <title>
 Add or replace molecular title
--unique, --unique <param>
 Do not convert duplicate molecules. See Remove duplicate molecules.
--writeconformers
 Output multiple conformers as separate molecules
-x <options> Format-specific output options. use -H <format-ID> to see options allowed by a particular format, or see the appropriate section in Supported File Formats and Options.
-v <SMARTS> Convert only molecules NOT matching the SMARTS pattern specified
-z Compress the output with gzip (not on Windows)

Examples

The examples below assume the files are in the current directory. Otherwise you may need to include the full path to the files e.g. /Users/username/Desktop/mymols.sdf and you may need to put quotes around the filenames (especially in Windows when they can contain spaces).

Standard conversion:

obabel ethanol.xyz -O ethanol.pdb
babel ethanol.xyz ethanol.pdb

Conversion if the files do not have an extension that describes their format:

obabel -ixyz ethanol.aa -opdb -O ethanol.bb
babel -ixyz ethanol.aa -opdb ethanol.bb

Molecules from multiple input files (which can have different formats) are normally combined in the output file:

obabel ethanol.xyz acetal.sdf benzene.cml -O allmols.smi

Conversion from a SMI file in STDIN to a Mol2 file written to STDOUT:

obabel -ismi -omol2

Split a multi-molecule file into new1.smi, new2.smi, etc.:

obabel infile.mol -O new.smi -m

In Windows this can also be written:

obabel infile.mol -O new*.smi

Multiple input files can be converted in batch format too. To convert all files ending in .xyz (*.xyz) to PDB files, you can type:

obabel *.xyz -opdb -m

Open Babel will not generate coordinates unless asked, so while a conversion from SMILES to SDF will generate a valid SDF file, the resulting file will not contain coordinates. To generate coordinates, use either the --gen3d or the --gen2d option:

obabel infile.smi -O out.sdf --gen3d

If you want to remove all hydrogens (i.e. make them all implicit) when doing the conversion the command would be:

obabel mymols.sdf -osmi -O outputfile.smi -d

If you want to add hydrogens (i.e. make them all explicit) when doing the conversion the command would be:

obabel mymols.sdf -O outputfile.smi -h

If you want to add hydrogens appropriate for pH7.4 when doing the conversion the command would be:

obabel mymols.sdf -O outputfile.smi -p

The protonation is done on an atom-by-atom basis so molecules with multiple ionizable centers will have all centers ionized.

Of course you don’t actually need to change the file type to modify the hydrogens. If you want to add all hydrogens the command would be:

obabel mymols.sdf -O mymols_H.sdf -h

Some functional groups e.g. nitro or sulphone can be represented either as [N+]([O-])=O or N(=O)=O. To convert all to the dative bond form:

obabel mymols.sdf -O outputfile.smi -b

If you only want to convert a subset of molecules you can define them using -f and -l. To convert molecules 2-4 of the file mymols.sdf type:

obabel mymols.sdf -f 2 -l 4 -osdf -O outputfile.sdf

Alternatively you can select a subset matching a SMARTS pattern, so to select all molecules containing bromobenzene use:

obabel mymols.sdf -O selected.sdf -s "c1ccccc1Br"

You can also select the subset that do not match a SMARTS pattern, so to select all molecules not containing bromobenzene use:

obabel mymols.sdf -O selected.sdf -v "c1ccccc1Br"

You can of course combine options, so to join molecules and add hydrogens type:

obabel mymols.sdf -O myjoined.sdf -h -j

Files compressed with gzip are read transparently, whether or not they have a .gz suffix:

obabel compressed.sdf.gz -O expanded.smi

On platforms other than Windows, the output file can be compressed with gzip, but note if you don’t specify the .gz suffix it will not be added automatically, which could cause problems when you try to open the file:

obabel mymols.sdf -O outputfile.sdf.gz -z

This next example reads the first 50 molecules in a compressed dataset and prints out the SMILES of those containing a pyridine ring, together with the index in the file, the ID (taken from an SDF property) as well as the output index:

obabel chembl_02.sdf.gz -osmi -l 50 -s c1ccccn1 --append chebi_id
       --addinindex --addoutindex

For the test data (taken from ChEMBLdb), this gave:

N1(CCN(CC1)c1c(cc2c3c1OCC(n3cc(c2=O)C(=O)O)C)F)C        3 100146 1
c1(c(=O)c2c(n(c1)OC)c(c(N1CC(CC1)CNCC)c(c2)F)F)C(=O)O   6 100195 2
S(=O)(=O)(Nc1ncc(cc1)C)c1c2c(c(N(C)C)ccc2)ccc1          22 100589 3
c1([nH]c2c(c1)cccc2)C(=O)N1CCN(c2c(N(CC)CC)cccn2)CC1    46 101536 4

Differences between babel and obabel

Essentially obabel is a modern version of babel with additional capabilities and a more standard interface. Over time, obabel will replace babel and so we recommend that you start using obabel now.

Specifically, the differences are as follows:

  • obabel requires that the output file be specified with a -O option. This is closer to the normal Unix convention for commandline programs, and prevents users accidentally overwriting the input file.

  • obabel is more flexible when the user needs to specify parameter values on options. For instance, the --unique option can be used with or without a parameter (specifying the criteria used). With babel, this only works when the option is the last on the line; with obabel, no such restriction applies. Because of the original design of babel, it is not possible to add this capability in a backwards-compatible way.

  • obabel has a shortcut for entering SMILES strings. Precede the SMILES by -: and use in place of an input file. The SMILES string should be enclosed in quotation marks. For example:

    obabel -:"O=C(O)c1ccccc1OC(=O)C" -ocan

    More than one can be used, and a molecule title can be included if enclosed in quotes:

    obabel -:"O=C(O)c1ccccc1OC(=O)C aspirin" -:"Oc1ccccc1C(=O)O salicylic acid"
           -ofpt
  • obabel cannot use concatenated single-character options.

Tip

To adapt a command line for babel into one for obabel you can usually simply put -O in front of the output filename.

Format Options

Individual file formats may have additional formatting options. These are listed in the documentation for the individual formats (see Supported File Formats and Options) or can be shown using the -H <format-Id> option, e.g. -H cml.

To use these additional options, input format options are preceded by -a, e.g. -as. Output format options, which are much more common, are preceded by -x, e.g. -xn. So to read the 2D coordinates (rather than the 3D) from a CML file and generate an SVG file displaying the molecule on a black background, the relevant options are used as follows:

babel mymol.cml out.svg -a2 -xb

Append property values to the title

The command line option --append adds extra information to the title of the molecule.

The information can be calculated from the structure of the molecule or can originate from a property attached to the molecule (in the case of CML and SDF input files). It is used as follows:

babel infile.sdf -osmi --append "MW CAT_NO"

MW is the ID of a descriptor which calculates the molecular weight of the molecule, and CAT_NO is a property of the molecule from the SDF input file. The values of these are added to the title of the molecule. For input files with many molecules these additions are specific to each molecule. (Note that the related option --addtotitle simply adds the same text to every title.)

The append option only takes one parameter, which means that all of the descriptor IDs or property names must be enclosed together in a single set of quotes.

If the name of the property in the SDF file (internally the Attribute in OBPairData) contains spaces, these spaces should be replaced by underscore characters, ‘_’. So the example above would also work for a property named CAT NO.

By default, the extra items are added to the title separated by spaces. But if the first character in the parameter is a whitespace or punctuation character other than ‘_’, it is used as the separator instead. Note that in the GUI, because Tab is used to move between controls, if a Tab character is required it has to be pasted in.

Filtering molecules from a multimolecule file

Six of the options above can be used to filter molecules:

  • -s - convert molecules that match a SMARTS string
  • -v - convert molecules that don’t match a SMARTS string
  • -f and -l - convert molecules in a certain range
  • --unique - only convert unique molecules (that is, remove duplicates)
  • --filter - convert molecules that meet specified chemical (and other) criteria

This section focuses on the --filter option, which is very versatile and can select a subset of molecules based either on properties imported with the molecule (as from a SDF file) or from calculations made by Open Babel on the molecule.

The aim has been to make the option flexible and intuitive to use; don’t be put off by the long description.

You use it like this:

babel filterset.sdf -osmi --filter "MW<130 ROTATABLE_BOND > 2"

It takes one parameter which probably needs to be enclosed in double quotes to avoid confusing the shell or operating system. (You don’t need the quotes with the Windows GUI.) The parameter contains one or more conditional tests. By default, these have all to be true for the molecule to be converted. As well as this implicit AND behaviour, you can write a full Boolean expression (see below). As you can see, there can be spaces or not in sensible places and the conditional tests could be separated by a comma or semicolon.

You can filter on two types of property:

  • An SDF property, as the identifier ROTATABLE_BOND could be. There is no need for it to be previously known to Open Babel.

  • A descriptor name (internally, an ID of an OBDescriptor object). This is a plug-in class so that new objects can easily be added. MW is the ID of a descriptor which calculates molecular weight. You can see a list of available descriptors using:

    babel -L descriptors

    or from a menu item in the GUI.

The descriptor names are case-insensitive. With the property names currently, you need to get the case right. Both types of identifier can contain letters, numbers and underscores, ‘_’. Properties can contain spaces, but then when writing the name in the filter parameter, you need to replace them with underscores. So in the example above, the test would also be suitable for a property ‘ROTATABLE BOND’.

Open Babel uses a SDF-like property (internally this is stored in the class OBPairData) in preference to a descriptor if one exists in the molecule. So with the example file, which can be found here:

babel filterset.sdf -osmi --filter "logP>5"

converts only a molecule with a property logP=10.900, since the others do not have this property and logP, being also a descriptor, is calculated and is always much less than 5.

If a property does not have a conditional test, then it returns true only if it exists. So:

babel filterset.sdf -osmi --filter "ROTATABLE_BOND MW<130"

converts only those molecules with a ROTATABLE_BOND property and a molecular weight less than 130. If you wanted to also include all the molecules without ROTATABLE_BOND defined, use:

babel filterset.sdf -osmi --filter "!ROTATABLE_BOND || (ROTATABLE_BOND & MW<130)"

The ! means negate. AND can be & or &&, OR can be | or ||. The brackets are not strictly necessary here because & has precedent over | in the normal way. If the result of a test doesn’t matter, it is parsed but not evaluated. In the example, the expression in the brackets is not evaluated for molecules without a ROTATABLE_BOND property. This doesn’t matter here, but if evaluation of a descriptor involved a lot of computation, it would pay to include it late in the boolean expression so that there is a chance it is skipped for some molecules.

Descriptors must have a conditional test and it is an error if they don’t. The default test, as used by MW or logP, is a numerical one, but the parsing of the text, and what the test does is defined in each descriptor’s code (a virtual function in the OBDescriptor class). Three examples of this are described in the following sections.

String descriptors

babel filterset.sdf -osmi --filter "title='Ethanol'"

The descriptor title, when followed by a string (here enclosed by single quotes), does a case-sensitive string comparison. (‘ethanol’ wouldn’t match anything in the example file.) The comparison does not have to be just equality:

babel filterset.sdf -osmi --filter "title>='D'"

converts molecules with titles Dimethyl Ether and Ethanol in the example file.

It is not always necessary to use the single quotes when the meaning is unambiguous: the two examples above work without them. But a numerical, rather than a string, comparison is made if both operands can be converted to numbers. This can be useful:

babel filterset.sdf -osmi --filter "title<129"

will convert the molecules with titles 56 123 and 126, which is probably what you wanted.

babel filterset.sdf -osmi --filter "title<'129'"

converts only 123 and 126 because a string comparison is being made.

String comparisons can use * as a wildcard. It can only be used as the first or last character of the string. So --filter "title='*ol' will match molecules with titles ‘methanol’, ‘ethanol’ etc. and --filter "title='eth*' will match ‘ethanol’, ‘ethyl acetate’, ‘ethical solution’ etc.

SMARTS descriptor

This descriptor will do a SMARTS test (substructure and more) on the molecules. The smarts ID can be abbreviated to s and the = is optional. More than one SMARTS test can be done:

babel filterset.sdf -osmi --filter "s='CN' s!='[N+]'"

This provides a more flexible alternative to the existing -s and -v options, since the SMARTS descriptor test can be combined with other tests.

InChI descriptor

babel filterset.sdf -osmi --filter "inchi='InChI=1/C2H6O/c1-2-3/h3H,2H2,1H3'"

will convert only ethanol. It uses the default parameters for InChI comparison, so there may be some messages from the InChI code. There is quite a lot of flexibility on how the InChI is presented (you can miss out the non-essential bits):

babel filterset.sdf -osmi --filter "inchi='1/C2H6O/c1-2-3/h3H,2H2,1H3'"
babel filterset.sdf -osmi --filter "inchi='C2H6O/c1-2-3/h3H,2H2,1H3'"
babel filterset.sdf -osmi --filter "inchi=C2H6O/c1-2-3/h3H,2H2,1H3"
babel filterset.sdf -osmi --filter "InChI=1/C2H6O/c1-2-3/h3H,2H2,1H3"

all have the same effect.

The comparison of the InChI string is done only as far as the parameter’s length. This means that we can take advantage of InChI’s layered structure:

babel filterset.sdf -osmi --filter "inchi=C2H6O"

will convert both Ethanol and Dimethyl Ether.

Substructure and similarity searching

For information on using babel for substructure searching and similarity searching, see Molecular fingerprints and similarity searching.

Sorting molecules

The --sort option is used to output molecules ordered by the value of a descriptor:

babel  infile.xxx  outfile.xxx  --sort desc

If the descriptor desc provides a numerical value, the molecule with the smallest value is output first. For descriptors that provide a string output the order is alphabetical, but for the InChI descriptor a more chemically informed order is used (e.g. “CH4” is before than “C2H6”, “CH4” is less than “ClH” hydrogen chloride).

The order can be reversed by preceding the descriptor name with ~, e.g.:

babel  infile.xxx  outfile.yyy  --sort ~logP

As a shortcut, the value of the descriptor can be appended to the molecule name by adding a + to the descriptor, e.g.:

babel  aromatics.smi  -osmi  --sort ~MW+
 c1ccccc1C=C   styrene 104.149
 c1ccccc1C     toluene 92.1384
 c1ccccc1      benzene 78.1118

Remove duplicate molecules

The --unique option is used to remove, i.e. not output, any chemically identical molecules during conversion:

babel  infile.xxx  outfile.yyy  --unique [param]

The optional parameter param defines what is regarded as “chemically identical”. It can be the name of any descriptor, although not many are likely to be useful. If param is omitted, the InChI descriptor is used. Other useful descriptors are ‘cansmi’ and ‘cansmiNS’ (canonical SMILES, with and without stereochemical information),’title’ and truncated InChI (see below).

Note that if you want to use --unique without a parameter with babel, it needs to be last on the line. With the alternative commandline interface, obabel, it can be anywhere after the output file.

A message is output for each duplicate found:

Removed methyl benzene - a duplicate of toluene (#1)

Clearly, this is more useful if each molecule has a title. The (#1) is the number of duplicates found so far.

If you wanted to identify duplicates but not output the unique molecules, you could use the null format:

babel  infile.xxx  -onul  --unique

Truncated InChI

It is possible to relax the criterion by which molecules are regarded as “chemically identical” by using a truncated InChI specification as param. This takes advantage of the layered structure of InChI. So to remove duplicates, treating stereoisomers as the same molecule:

babel  infile.xxx  outfile.yyy  --unique /nostereo

Truncated InChI specifications start with / and are case-sensitive. param can be a concatenation of these e.g. /nochg/noiso:

/formula   formula only
/connect   formula and connectivity only
/nostereo  ignore E/Z and sp3 stereochemistry
/nosp3     ignore sp3 stereochemistry
/noEZ      ignore E/Z stereoochemistry
/nochg     ignore charge and protonation
/noiso     ignore isotopes

Multiple files

The input molecules do not have to be in a single file. So to collect all the unique molecules from a set of MOL files:

babel  *.mol  uniquemols.sdf  --unique

If you want the unique molecules to remain in individual files:

babel  *.mol  U.mol  -m  --unique

On the GUI use the form:

babel  *.mol  U*.mol  --unique

Either form is acceptable on the Windows command line.

The unique molecules will be in files with the original name prefixed by ‘U’. Duplicate molecules will be in similar files but with zero length, which you will have to delete yourself.

Aliases for chemical groups

There is a limited amount of support for representing common chemical groups by an alias, e.g. benzoic acid as Ph-COOH, with two alias groups. Internally in Open Babel, the molecule usually has a ‘real’ structure with the alias names present as only an alternative representation. For MDL MOL and SD files alias names can be read from or written to an ‘A’ line. The more modern RGroup representations are not yet recognized. Reading is transparent; the alias group is expanded and the ‘real’ atoms given reasonable coordinates if the the molecule is 2D or 3D. Writing in alias form, rather than the ‘real’ structure, requires the use of the -xA option. SVGFormat will also display any aliases present in a molecule if the -xA option is set.

The alias names that are recognized are in the file superatoms.txt which can be edited.

Normal molecules can have certain common groups given alternative alias representation using the --genalias option. The groups that are recognized and converted are a subset of those that are read. Displaying or writing them still requires the -xA option. For example, if aspirin.smi contained O=C(O)c1ccccc1OC(=O)C, it could be displayed with the aliases COOH and OAc by:

obabel aspirin.smi  -O out.svg  --genalias  -xA