Written by Huan Tran, University of Connecticut (email@example.com), January 2017.
SMILES, which stands for "simplified molecular-input line-entry system", uses short ASCII string to represent the structure of chemical species. With some small adjustments, it can also be used to represent the repeating unit of polymers. Below are some rules applied for the SMILES format of polymer repeating units.
Rules for polymer repeating block SMILES
Because the SMILES format described here is used for polymers, it is not completely identical
with any SMILES format. Strictly following the rules explained below is crucial for having correct
results. Some, but not all rules are checked by Polymer Genome, and error messages will be printed out if
found. Details of the rules are given below, while the SMILES strings of some example polymer
blocks and polymers are provided in Table 1.
No space permitted in a SMILES string.
An atom is represented by its respective atomic symbol. In case of 2-character atomic symbol,
it is placed between two quare brackets [ ].
Single bonds are implied by placing atoms next to each other. A double bond is represented
by the = symbol while a triple bond is represented by #.
Hydrogen atoms are suppressed, i.e., the polymer blocks are represented without hydrogen.
Polymer Genome interface assumes typical valence of each atom type (see Table 2).
If enough bonds are not identified by the user through SMILES notation, the dangling bonds will
be automatically saturated by hydrogen atoms.
Aromatic atoms are not distinguished from others. Please clearly indicate the nature of
the bonds instead. For example, the benzene molecule C6H6 should not be c1ccccc1 like
elsewhere, but C1=CC=CC=C1 for Polymer Genome. Therefore, atom names should only start by capital
cases, not lower cases.
Branches are placed between a pair of round brackets ( ), and are assumed to attach
to the atom right before the opening round bracket (.
Numbers are used to identify the opening and closing of rings of atoms. For example, in C1CCCCC1,
the first carbon having a number "1" should be connected by a single bond with the last carbon,
also having a number "1". Polymer blocks that have multiple rings may be identified by using
different, consecutive numbers for each ring. These numbers are assumed to be integer, starting
from 1 with unit increments. Currently, the bond connecting between two ends of a ring must be
a single bond. Other cases will be considered latter.
A SMILES string used for Polymer Genome represents the repeating unit of a polymer, which has 2
dangling bonds for linking with the next repeating units. It is assumed that the repeating
unit starts from the first atom of the SMILES string and ends at the last atom of the string.
These two bonds must be the same due to the periodicity. It can be single, double, or triple,
and the type of this bond must be indicated for the first atom. For the last atom, this is
not needed. As an example, CC represents -CH2-CH2- while =CC represents =CH-CH=.
Table 1: SMILES strings of the polymer blocks considered and some polymers constructed from them.