In this article, the prototype of a SMILES-based portfolio screening tool completely written in Python using the libraries pandas, rdkit and tkinter is presented. This tool supports regulatory group entry screenings of chemical substances based on its structural similarity to regulated chemical substances. This structure-based screening can be of additional support if the substance in question lacks any regulatory identifier (e. g. CAS or EC number) required for an identifier-based portfolio screening. As a result of the screening, the user receives a regulated substances list ranked by the Tanimoto coefficient top to bottom. Based on this result, the user of this tool can further evaluate if the substance in question is similar and thus in scope of the respective regulated chemicals found in the screening or not.  

Description of the GEST3 tool

The main idea behind this is determining the structural similarity of chemical components which can be calculated using their SMILES codes. One of the most popular similarity measures for comparing chemical structures represented by means of fingerprints is the Tanimoto[1] coefficient T. Two structures are usually considered similar if T > 0.85. However, for chemical similarity, also a Tanimoto coefficient of ≥ 0.7 is acceptable for conclusion of structural similarity between two substances[2].

In Python, the library rdkit[3] contains modules enabling drawing, reading, calculating the fingerprints of molecular structures and comparing their fingerprints[4]. The default similarity metric used by rdkit is the Tanimoto similarity. I have implemented this functionality of rdkit in the GEST3 tool as basis for the SMILES-based screening script. The GUI of the Group Entry Screening Tool 3 (in short GEST3) was completely developed in tkinter[5]. With the present tool, a SMILES code of a chemical substance can be entered and checked for similar molecules that are regarded as SVHC[6], restricted substance[7] or Annex XIV-listed substance[8] under the REACH Regulation. While the user has a simple GUI and just needs to enter the SMILES code of the substance to be screened, the tool handles the similarity screening in the background, meaning the comparison and sorting of the regulated chemicals structurally matching more or less with the entered SMILES code. The source for these regulated chemicals are the present regulatory inventories publicly available on the ECHA website, e.g. restricted chemicals under REACH Annex XVII[7].

Considering SVHC candidates, Annex XVII and Annex XIV substances together, approximately between 600 and 700 structures are known to be regulated under REACH in mentioned lists. These REACH-regulated substances and their respective SMILES codes, as far as available, are stored in the GEST3 tool and eligible to the screening.

In practice, the user enters a SMILES code of a substance in question (e.g. perfluorooctanoic acid SMILES code) and just clicks ‘Search’ (see Figure 1).

Figure 1: Starting GUI and entered SMILES code of perfluorooctanoic acid (PFOA)

The GEST3 tool calculates then the Tanimoto coefficient T between the given SMILES and the SMILES codes of the regulated substances available in the tool (see Figure 2).

As a result (see Figure 2 below), the found similar substances are displayed in descending order of the Tanimoto coefficient ranging from T = 1.0 (100 % structural match) down to T = 0.2 (no structural match, see left-hand side of Figure 2). For each substance in the resulting list, the Substance name, CAS number (if available), Tanimoto coefficient, (regulatory) Group, SMILES code and molecular structure is displayed (see right-hand side of Figure 2).

Figure 2: Screening result after similarity screening with SMILES code of PFOA

Please note that the SMILES-based screening is technically only feasible for organic (non-polymeric) compounds and partly for inorganic substances. Thus, only these substance classes can deliver reliable results, while inorganic components tend to deliver false values for the Tanimoto coefficient and polymers are not eligible for the screening due to lack of SMILES codes for macromolecular structures. However, it is in the responsibility of the user to evaluate the findings further and perform additional analyses with the found substances with similar structures (e.g. perfluorooctanoic acid derivatives).

The screening results can also be printed as Excel file by clicking the ‘Create Report’ button. This functionality comes from using the ExcelWriter of the pandas[9] library in the GEST3 tool. Here it is worth mentioning that, when a report is created in Excel after the screening, it should be saved with another filename, otherwise a subsequently performed second screening and report creation will overwrite the first report.       

DISCLAIMER: Since regulatory inventories are developing and more and more substance groups are added continuously, the content on the structures eligible for the similarity screening in GEST3 would have to be updated frequently to consider the latest regulatory updates properly. Also, since this tool solely uses the SMILES codes for the similarity screening, it cannot be applied for a screening of regulated substances based on their GHS hazard class which is not a chemical structure-related property. Nevertheless, the GEST3 tool is a simple prototype of a structure-based screening tool that can be further developed and optimized.

Some technical details and source code for GEST3 tool

First, I needed to gather the substance names and CAS numbers (as far as available) of the 600 – 700 SVHC and/or restricted substances in a list. This was mostly a manual process by downloading the relevant inventory lists from the ECHA website, combining them into one Excel file and indicating the regulated group each substance belongs to in an extra column (this last step required expert judgement and could therefore not be automated 😉 ). This Excel list was simply saved as csv file then (named ‘Group_substances_consolidated.csv’) and readily available for the next step, retrieving the SMILES codes per substance.

Here, a laborious way would be to draw each molecule in a chemical drawing program manually (such as MarvinSketch[10]) and generate the SMILES code from the structure. Another way – the one I preferred to do – is to collect these SMILES codes per substance from a reliable public chemical database via REST API.

In my case, I used the PubChem Database[11] since it contains one of the biggest datasets for any kind of chemicals accessible free-of-charge and has a very good documentation on its API[12]. With the SMILES codes downloaded from said database, I finally had the basic structural information of SVHC and/or restricted chemicals as csv file
‘Group_substances_consolidated.csv’ available, ready to be converted into a sdf file. This sdf file would then contain the structural information of each of the SVHC and/or restricted substances in the appropriate data format eligible for the Tanimoto screening.

Here is the code that converted the SMILES codes from the csv file into sdf data:

#! python3

# - Script for converting SMILES available in a csv file into an sdf file.

'''A script using the openbabel database to convert SMILES codes
given in a simple csv file into a sdf file eligible 
for the Tanimoto similarity screening. The library openbabel
has to be installed and imported first since it is not part of 
rdkit or pandas.'''

import openbabel
import pandas as pd
from os import chdir

# Define your working directory

# Load data to be converted to sdf file and store it as dictionary
df = pd.read_csv('Group_substances_consolidated.csv', sep = '|')
dict_substances = df.set_index('Substance_Name').to_dict()

# Check if dictionary is correct (optional)

# Iterate through dictionary 'dict_substances', read and write as sdf data.
output = '' # Variable storing the newly generated sdf data per SMILES

# Loop over all SMILES codes in dict_substances
for k, v in dict_substances.items():    
    if k == 'SMILES':
        for Substance_Name, SMILES in v.items():
            s = str(SMILES) #  convert each SMILES code to string
            obConversion = openbabel.OBConversion() # instantiate obConversion
            obConversion.SetInAndOutFormats("smi", "sdf") # call openbabel method to set I/O format
            mol = openbabel.OBMol() # create variable mol as object of OBMol()
            obConversion.ReadString(mol, s) # read given SMILES string
            outMDL = obConversion.WriteString(mol) # convert SMILES string into sdf
            output = output + str(Substance_Name) # retrieve sdf data in combination with subst name
            output += outMDL # increment as long as SMILES codes are available for conversion

# Finally print output to a simple text file
output = output.encode('utf-8').decode('ascii', 'ignore')
with open('Group_substances_converted.sdf', 'w') as f:

As you see from the script, using the Openbabel module was essential to convert the SMILES codes into sdf data. Also, here, pandas came in handy for loading the csv file as dataframe and converting it into a dictionary with the (column) ‘Substance name’ as key for each substance in the csv file.

Then, iteration over this dictionary with its SMILES codes cast into strings enabled conversion into sdf data with Openbabel’s built-in methods from the class OBConversion(). The resulting output string could then be saved to a sdf file, named ‘Group_substances_converted.sdf’ in my case. This last step gave me a bit of a headache since saving without the encoding and then decoding step did not work. It looks a bit cumbersome, but it worked anyway 🙂 .

So now I had the right data format available for the similarity screening – yay!

The essential piece of code in this tool is the similarity check script displayed below:

#! python3
# - Tanimoto similarity screening function performing the work behind the GUI.

import pandas as pd
from rdkit import Chem
from rdkit.Chem.Fingerprints import FingerprintMols
from rdkit.Chem import DataStructs

def tanimoto_similarity(ms, mol):
    # Convert Mol to datastructure type using a list comprehension
    dfps = {} # dictionary for storing fingerprints of each molecule in ms/suppl
    for (c, i) in enumerate(ms):
        if c < 1000000: # limit printing to maximum of one million items
            if i is not None:
                Substance_Name = i.GetProp('_Name') # get Substance_Name into header of each i's sdf
                fp = FingerprintMols.FingerprintMol(i) # create fingerprint of SMILES of i
                dfps[Substance_Name] = fp # store fp combined with its Substance_Name in dfps 

    # Execute query if mol is not empty
    if mol is None:
        raise ValueError('bad input')
        return []
    query = FingerprintMols.FingerprintMol(mol)

    # Declare new list for storing similarities
    lst = []
    # loop through dfps to find Tanimoto similarity
    for Substance_Name, v in dfps.items():
        # tuple: (idx, similarity)
        lst.append((Substance_Name, str(DataStructs.FingerprintSimilarity(query, v))[:3]))
    # Convert lst to df for merging & formatting.
    df = pd.DataFrame(lst, columns=['Substance_Name', 'Tanimoto_coefficient'])
    df['Tanimoto_coefficient'] = pd.to_numeric(df['Tanimoto_coefficient'], downcast="float")
    df1 = df[df.Tanimoto_coefficient >= 0.2]
    df1['Tanimoto_coefficient'] = df1['Tanimoto_coefficient'].astype(str)

    # Convert lst to df1 for merging with Substance_Name and SMILES per substance
    df1['Substance_Name'] = df1['Substance_Name'].astype(str)
    df2 = pd.read_csv('Group_substances_consolidated.csv', sep= '|')
    df2['Substance_Name'] = df2['Substance_Name'].astype(str)
    df3 = df1.merge(df2, on = 'Substance_Name', how = 'left')
    # Re-convert to list format.
    lst2 = df3.values.tolist()

    # Sort list using the similarities in descending order in lst2.
    lst2.sort(key=lambda x:x[1], reverse=True)

    return lst2

def run_similarity_search(SMILES, supplier_file):
    # load a mol from a SMILES string - USER INPUT, can be changed as needed
    mol = Chem.MolFromSmiles(SMILES)
    # load the supplier database used as comparison data pool of molecules
    suppl = Chem.SDMolSupplier(supplier_file)
    # Instantiate the query with given mol/call whole function
    result = tanimoto_similarity(suppl, mol)
    return result

# Dump result to text file, but limit it only to 500 digits to be printed to the file to get the first hits
if __name__ == '__main__':
    result = run_similarity_search('CCCCCCCCCCCC(=O)[O-].CCCCCCCCCCCC(=O)[O-].[Cd+2]', 'C:/Path/to/your/directory/Group_substances_converted.sdf')
    with open('SMILES_Tanimoto_Screen.csv', 'w') as f:
        for i, line in enumerate(str(result)):

The data basis for the similarity screen was the sdf file generated from the first step as explained above (i.e. the file ‘Group_substances_converted.sdf’). In the function tanimoto_similarity, the fingerprints of all these substances are calculated. This fingerprint information per substance is stored together with the substance name in a dictionary (named dfps in my code).

Next, the fingerprints of the input molecule (i.e. SMILES code entered by the user) are calculated as well and compared to the fingerprints available of all substances in the dictionary dfps. The results are stored in a list then. Formatting this list was done after loading it into a dataframe, adding a new column ‘Tanimoto coefficient’ and filling this column with the calculated Tanimoto coefficient per compared substance. In order to get the SMILES code and substance name information per substance back into the table, a simple pandas merge step on the ‘Substance_Name’ column of both the dataframe as well as the csv file ‘Group_substances_consolidated.csv’ was done. Sorting the Tanimoto coefficients in descending order was achieved after converting the dataframe into a list and sorting it with the sort() function.

The resulting list ‘lst2’ is then the similarity screening result which can then be displayed in the GUI script of the GEST3 tool. The GUI script also handles the display of the molecular structures per found similar substance, again using the rdkit.Chem methods such as Chem.MolFromSmiles(), rdDepictor.Compute2DCoords() and rdkit.Draw.MolToFile(). Additionally, there are many more features in the GUI script of the GEST3 tool that are not explained in detail here since I only wanted to focus on the similarity screening part and the data used for it.

Anyway, if you are interested in the full source code of the GEST3 tool, meaning the two scripts explained above as well as the GUI script and its helper functions, just leave a comment in the message box below. I would then upload the source code to either this website or Bitbucket accordingly.

Literature & resources:

[1]          Tanimoto TT (17 Nov 1958). “An Elementary Mathematical theory of Classification and Prediction”. Internal IBM Technical Report1957 (8?)

[2]          Bero, S. A., Muda, A. K., Choo, Y-H., Muda, N. A., Pratama, S. F., “Weighted Tanimoto Coefficient for 3D Molecule Structure Similarity Measurement”, arXiv:1806.05237, June 2018

[3]          rdkit website:

[4]          Fingerprints:

[5]          tkinter website:, GUI cover picture from, author: kalhh

[6]          Substance of Very High Concern (SVHC) inventory, ECHA official website:                   

[7]          REACH Annex VII substances, ECHA official website:                restricted-under-reach

[8]          REACH authorization list, ECHA official website:

[9]          pandas website:

[10]        MarvinSketch available at    

[11]        PubChem Database:

[12]        PubChem REST API and documentation: and

[13]        Openbabel module:


Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *