SMILES ( Simplified Molecular Input Line Entry System , from the English - “system of simplified representation of molecules in the input line”) is a system of rules (specification) for an unambiguous description of the composition and structure of a chemical molecule using an ASCII character string. The name in English is a homonym to the word smiles ( smiles ), however, it is written only in capital letters. The Russian language has no unambiguous analogue, it is recommended to use the original language. Pronounced as smiles.
A string of characters compiled according to the SMILES rules can be transformed by many molecular editors into a two-dimensional or three-dimensional structural formula of the molecule .
The initial version of the SMILES specification was developed by Arthur Weininger and David Weininger in the late 1980s [1] . Subsequently, the standard was modified and expanded; Daylight Chemical Information Systems, Inc. took the most active part in this work . .
Among the other linear notations, it is worth noting the Wiswesser notation (WLN), SMARTS , ROSDAL and Sybyl Line Notation ( Tripos Inc. ). IUPAC recently proposed InChI as a standard for the linear representation of formulas. SMILES has advantages over InChI, in particular, a better perception of formulas by a person, as well as simpler software support due to the presence of an extensive theoretical base - graph theory .
Content
- 1 SMILES Specification Options
- 2 Definition in terms of graph theory
- 3 Basic principles of building SMILES
- 3.1 Atoms
- 3.2 Communications
- 3.3 Branching a molecule
- 3.4 cyclic compounds
- 3.5 Stereochemistry
- 4 Extensions
- 5 Conversions
- 6 See also
- 7 notes
- 8 References
SMILES Specification Options
There are no rules in the original SMILES specification regarding the way to record and how to distinguish between spatial isomers of molecules. To solve these problems, extensions to the standard were developed:
- “Canonical SMILES” [2] - a version of the specification that includes the rules of canonization , allowing you to write down the formula of the molecule of any substance in a unique way. These rules relate to the choice of the first atom in the record, the direction of the loop, the choice of direction of the main chain at branching. Because different molecular modeling packages use different SMILES canonicalization algorithms, which may result in different records of the same molecule, the concept of “canonical SMILES” is not absolute. This version of the standard is usually used to index and verify the uniqueness of molecules in databases.
- “Isomeric SMILES” [3] is a version of the specification that allows you to include in the record data on the isotopic composition , configuration of asymmetric carbon atoms and double bonds . The peculiarity of this version in comparison with the official IUPAC nomenclature is that isomeric SMILES allows you to store information about molecules for which the configurations of only some chiral centers or double bonds are known.
Definition in terms of graph theory
In terms of graph theory, SMILES is a string obtained by outputting the symbols of the vertices of a molecular graph in the order corresponding to going in depth . The initial processing of the graph includes the removal of hydrogen atoms and the breakdown of the cycles so that the resulting graph is a spanning forest . The places where the graph is partitioned are mapped to numbers that indicate the presence of a bond in the original molecule. Brackets are used to indicate branch points of the molecule.
SMILES Building Principles
Atoms
Atoms are denoted by symbols of chemical elements in square brackets , for example, gold is denoted as [Au] . For organogen elements ( B , C , N , O , P , S , F , Cl , Br , I ), brackets may be omitted. In this case, hydrogen atoms can be omitted explicitly if their number corresponds to the lowest normal valency in accordance with explicitly given bonds. Atoms in aromatic cycles are usually written in lowercase instead of uppercase, although in some SMILES dialects an explicit alternation of double and single bonds is used (as in the structural formula of benzene proposed by Kekula ). If necessary, indicate the formal charge of the particle, hydrogen atoms and the charge symbol are written in explicit form [3] . Isotopes are written in square brackets with the atomic weight in front of the atom symbol, for example, the 13 C isotope will be written as [13C] .
For example, the SMILES entry for water will look like O , for ethanol - CCO . The hydroxyl anion is recorded [OH-] , and the iron (II) ion as [Fe+2] .
Links
A single chemical bond can be written using the symbol - between atoms connected by a bond, but this is not applied in practice, the hyphen symbol is omitted. The aromatic symbol ( :) is also usually omitted. The double bond is denoted by an equal sign , for example, carbon dioxide is written as O=C=O The triple bond is denoted by octotorpe , for example, hydrocyanic acid is written as C#N
Branching a molecule
The side chains of the molecule are enclosed in parentheses . For example, propionic acid is written as CCC(=O)O The canonical form of trifluoromethane recording looks like C(F)(F)F , however, such a recording is inconvenient for reading due to its overloaded brackets, so the same molecule can be written in non-canonical form as FC(F)F
Loopback
Atoms located at the ends of a bond broken during the construction of a spanning forest are indicated by the same number. For example, cyclohexane is written as C1CCCCC1 and benzene as c1ccccc1 .
Stereochemistry
The configuration for a double bond is recorded using the characters / and \ . For example, F/C=C/F corresponds to trans - difluoroethylene , and F/C=C\F or F\C=C/F corresponds to cis- difluoroethylene (see. Fig.).
Extensions
SMARTS is a modification of SMILES that allows using the disordered structure of atoms and bonds. Widely used in search engines in substance databases. The practice of application caused a widespread misconception that in a computer search for structures, comparisons of chain records are performed, while a much more productive comparison of graphs based on SMILES formulas is performed.
Conversions
The SMILES formula can be transformed into a two-dimensional structural formula using the Structure Diagram Generation algorithms developed by Helson [4] . Conversion does not always give an unambiguous result. Conversion to a three-dimensional structural formula is performed using the principle of minimum energy of formation of a substance.
See also
- Molecular editor
- International Chemical Identifier ( InChI )
Notes
- ↑ David Weininger. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules // J. Chem. Inf. Comput. Sci .. - 1988. - T. 28 , No. 1 . - S. 31-36 .
- ↑ David Weininger, Arthur Weininger, Joseph L. Weininger. SMILES. 2. Algorithm for generation of unique SMILES notation // J. Chem. Inf. Comput. Sci .. - 1989. - T. 29 , No. 2 . - S. 97-101 .
- ↑ 1 2 SMILES - A Simplified Chemical Language . Daylight Chemical Information Systems, Inc .. - Description of the SMILES standard on the Daylight website. Date of treatment May 4, 2009. Archived February 12, 2012.
- ↑ Helson, Harold E. (1999) Structure Diagram Generation. Reviews in Computational Chemistry 13, 313-98, Eds. Lipkowitz, KB, Boyd, DB, Wiley-VCH Press.