Extracting SMILES codes from open chemical databases

Today we luckily have a lot of open scientific databases readily available on the web which are packed with a lot of valuable chemical information. In my case, I recently had the task to collect thousands of SMILES codes of various chemical substances in a swift and pragmatic manner. For this task, I evaluated which databases would contain most of the information I was looking for and found that both the CACTUS^[1] as well as the PubChem^[2] database were the best fit. So, I decided to access these two databases via their API with Python, taking a list of substances’ CAS numbers and their substance names as input and querying both mentioned databases for the respective SMILES codes. The output was then a neat list with all the substances’ names and/or CAS numbers and SMILES codes found per substance. Let’s have a look how I did this.

Starting with the input data: I had an Excel list of over 4,000 chemicals with only their chemical name and CAS number, if available. The list looked somehow like the example shown in Figure 1, but much longer. First, I converted the Excel list containing the substances without SMILES into csv format (Substances_with_CAS_without_SMILES.csv) before using it for the SMILES code queries in Script 1 and Script 2, respectively. Here, it was crucial to take care that the delimiter was set to the pipe ‘|’ symbol instead of comma or semicolon. The reason for this is that comma or semicolon are often used in chemical names, thus could cause problems and exceptions when running this script. Furthermore, in order not to get lost which SMILES code belongs to either a Chemical name and/or a CAS number, a Substance_Primary_Key should be used in the input list: In the simplest form, it is a unique number like 1, 2, 3, 4, 5 and so on in an extra column for each substance per row (see Figure 1).

*Figure 1: Snippet of an example substance csv list with CAS numbers*

Now I had to check if and where I could get the SMILES^[4] code for each of these substances in this csv file without searching every substance manually on the web. As I mentioned in my introduction, the two best suitable scientific databases for my task were the CACTUS and the PubChem database. Both databases are accessible free-of-charge and contain a huge variety of chemical substances and their data, such as the substances’ SMILES codes. And the good thing is that both databases also have an API that is accessible via scripts, e.g. with a Python script 😊 so I jumped right into coding.

The first database I accessed was the PubChem database, where collection of SMILES codes was only possible via the substances’ respective CAS numbers. Main players and most important Python libraries are urllib.request and json, enabling access to the databases API and reading the data format stored in there.

The principle of the script is this: Each CAS number is converted into a CID number first (see Script 1, CID is an identifier number particularly used in PubChem database) and then this CID number is used in the URL targeting the desired substance and its SMILES code, if available.

Script 1: Accessing the PubChem database for SMILES extraction

#! python3 - Script1.py - Retrieve SMILES codes from PubChem API 

'''This script enables automatically connecting to the PubChem database, 
transfer of CAS numbers which are converted to CID identifiers
as first step and then resolved to respective SMILES codes.'''

# Import the library necessary for making a web service request.
from os import chdir
import urllib.request, urllib.error
import json
import time
import pandas as pd

# Define working directory
chdir('C:/Path/to/your/working/directory')

# Function for resolving given CAS number into CID. Therefore
# variables for PUG-REST request URL pieces are defined 
def cas_to_cid(cas):
    path_prolog = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
    path_compound = '/compound/'
    path_name = 'name/'
    path_cas = cas
    path_cas_rest = '/cids/JSON'
    
    url = path_prolog + path_compound + path_name + path_cas + path_cas_rest
# Make a PUG-REST request and store the output in "request"
    print('cas_to_cid:', url)
    try:
        request = urllib.request.urlopen(url)
    except urllib.error.HTTPError:
        print('HTTPError while requesting cas', cas)
        return ''
    
    # Give the output/reply back as JSON and return CID number from function
    if request is not None:
        reply = request.read()
        if reply is not None and len(reply) > 0:
            json_out = json.loads(reply)
            cid = json_out['IdentifierList']['CID'][0]
            return cid
    return ''

# Function for searching and extracting SMILES code with entering CID 
def cid_to_smiles(cid):
    path_prolog = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug'
    path_compound = '/compound/'
    path_name = 'cid/'
    path_cid = str(cid)
    path_cid_rest = '/property/CanonicalSMILES/JSON'
    
    url = path_prolog + path_compound + path_name + path_cid + path_cid_rest
        
    # Make a PUG-REST request and store the output in "request"
    print('cid_to_smiles:', url)
    try:
        request = urllib.request.urlopen(url)
    except urllib.error.HTTPError:
        print('HTTPError while requesting cid', cid)
        return ''
    
    # Give the reply in JSON format, access and return the SMILES code
    if request is not None:
        reply = request.read()
        if reply is not None and len(reply) > 0:
            json_out = json.loads(reply)
            #return json_out
            smiles = json_out['PropertyTable']['Properties'][0]['CanonicalSMILES']
            return smiles
    return ''

# NOTE: to do this for many CAS numbers, iterate thru the given list and call above functions to 
# resolve to cid and, in turn, to SMILES. Sleep between each request to avoid overloading
# PubChem servers.

# Load list with CAS numbers where SMILES code is to be requested
df = pd.read_csv('Substances_with_CAS_without_SMILES.csv', sep = '|')
list_cas = df['CAS'].astype(str).values.tolist()

# Both functions described above are now called in the third function map_cas_list_to_csv
def map_cas_list_to_csv(list_cas):
    output = ''
    for cas in list_cas:
        cid = cas_to_cid(cas)
        if len(str(cid)) > 0:
            smiles = cid_to_smiles(cid)
            if len(smiles) > 0:
                line = cas + '|' + str(cid) + '|' + smiles
                output = output + line + '\n' # create and concatenate output
                print(line)
                time.sleep(0.8) # sleep after each loop for 0,8 seconds
    return output

s_out = 'CAS|CID|SMILES\n'
output = map_cas_list_to_csv(list_cas) # call function for generating final result
final = s_out + output # now final contains a complete csv as string, just write it out to a file.

with open("RESULT_Substances_with_CAS_with_SMILES.csv", "w") as file:
    file.write(final)

The input csv file was read to a pandas dataframe (see Script 1) and all values in column ‘CAS’ collected in a list variable list_cas eligible for function def map_cas_list_to_csv(list_cas). In this function, we iterate over the list of CAS values and map it with the CID number per CAS, if available in the PubChem database. The cas_to_cid-helper function does the whole trick, as defined in Script 1. Only if a CID number is present and leads to a SMILES code, an output string (i.e. the CID number in string format) is obtained. Since all data per substance is stored in JSON format in the database, the json library is imported and used in the cas_to_cid function to extract the CID code per substance.

Next, in the cid_to_smiles function, each collected CID number is then used to request the SMILES code in the individual URL per substance. Please note that each of these request commands occupy resources from the database, thus too many requests in a short period of time could lead to access denial by the database. Therefore the time.sleep() function makes sure that there is a short pause (here, 0.8 seconds) between each request to avoid overloading the database. I recommend not to use lists longer than 500 or 600 CAS numbers at one run to avoid access denial and stopping the script without generating the desired output ☹.

Finally, the collected CID and SMILES codes per CAS number of a substance are written to a csv file that can be opened and saved as Excel file as needed (see Figure 2):

*Figure 2: Resulting csv file with desired SMILES Codes and CID number*

Note that the Substance Primary Key does not appear in the resulting list, since it was not used and not relevant in my case for the CAS-and CID-based requests. In Script 2, we make use of the Substance Primary Key then. Also, one SMILES code for CAS 107215-67-8 was not found, thus does not appear in the list in Figure 2.

Coming to my second script for accessing the CACTUS database, I had a similar approach, but this time, it was also possible to query the database not only via a substance’s CAS number, but also via its chemical name. Here it goes:

Script 2: Accessing the CACTUS database for SMILES extraction

#! python3 - Script2.py - Retrieve SMILES codes from NCI/CACTUS database

'''This script takes an input of a list with chemical substances and transfers their
CAS numbers to the API of the nci.nih.gov chemical database for retrieving SMILES codes
for each CAS and substance as far as available in the database.'''

from os import chdir
import pandas as pd
from pandas import ExcelWriter
from urllib.parse import quote
from urllib.request import urlopen

# Define working directory
chdir('C:/Path/to/your/working/directory')

# Connection to API function
def CIRconvert(ID):
    try:
        url = 'http://cactus.nci.nih.gov/chemical/structure/' + quote(ID) + '/smiles' # quote important!
        print('resolving', url)
        ans = urlopen(url).read().decode('utf8')
        return ans
    except:
        return 'Did not work'

# Load dataframe with substances and their CAS numbers and convert to a dictionary
# Substance_Primary_Key could be a column with an unique identifier number in the list.
df = pd.read_csv(Substances_with_CAS_without_SMILES.csv', sep = '|')
dicti = df.set_index('Substance_Primary_Key').to_dict()

# Create two empty dictionaries for storing newly generated infos obtained from for-loops below
dSMILES = {}
dictTemp = {}

# Iterate thru dicti and map CAS to SMILES code via API
for k, v in dicti.items():
    if k == 'CAS':
        for S_Prim_Key, CAS in v.items():
            s = CIRconvert(CAS) # map via external API
            if s != 'Did not work':
                dSMILES[S_Prim_Key] = s # add SMILES code for this CAS number to 
# dSMILES dictionary
            else: # if mapping didn't work, put Substance’s name ‘Chem’ in temp dict for later mapping 
                  # via Chemical name
                dictTemp[S_Prim_Key] = dicti['Substance name'][S_Prim_Key]

# If CAS numbers could not be mapped to a SMILES code, try mapping a Substance_Primary_Key's Chemical  
# name to a SMILES code
if dictTemp: # if dictTemp isn't empty
    print('Could not map some CAS numbers, trying via Chemical name...')
    for S_Prim_Key, Chem in dictTemp.items():
        s = CIRconvert(Chem)
        if s != 'Did not work':
            dSMILES[S_Prim_Key] = s
        else:
            print('Error')
                
# Add new subdict 'SMILES' to our dicti
dicti['SMILES'] = dSMILES

# ... and convert back to a DataFrame.
df2 = pd.DataFrame.from_dict(dicti)

# dump to excel file
writer = pd.ExcelWriter('Substances_with_SMILES_from_Cactus-DB.xlsx')
df2.to_excel(writer)
writer.save()

You can see in Script 2 that it uses the same libraries as Script 1 and has a similar structure. The main difference here is that conversion of the CAS into the CID number – which is a unique identifier used particularly in the PubChem database – is not required. Instead, the function def CIRconvert(ID) making the API call for each substance can take both CAS number as a first choice to extract the SMILES code per substance. If a substance has no CAS number, the query is done via its substance name (variable Chem in the Script) as identifier to query for the SMILES code. The CAS and Chemical name per substance are linked together over the substance’s Substance Primary Key as dictionary. So the Substance Primary Key appears in the resulting Excel list together with the substance name, CAS number (if available) and found SMILES code per substance. Please also check the comments in Script 2 for further details. When everything runs smooth by executing this script, the resulting list looks like shown in Figure 3.

*Figure 3: Resulting list with SMILES codes found*

As you see from the result depicted in Figure 3, one advantage to the output shown in Figure 2 is that the primary key is also written to the Excel file (could also be written to csv file, if needed). However, for two substances, no CAS number was found in the CACTUS database.

This teaches me two things: First, sometimes more than one database should be queried for the same information to combine all collected knowledge. And second, if possible, a unique identifier should be kept before and after such query, such as the Substance Primary Key in the demonstrated example. This helps to easily verify which input data was given back and if something is missing.

Just one important note at the end of this article: Bottom line for both SMILES queries in Script 1 and 2 is that for polymers and some naturally occurring substances, a SMILES code is not available, thus the result is no SMILES code for this substance then. Furthermore, in rare cases when the substance is a brand-new chemical and the CAS number was very recently assigned, it is not yet recorded in the PubChem database, thus cannot be found there.

That’s all for today. Hope you enjoyed this article and maybe try this for yourself and let me have your comments in case you have questions or suggestions. Thanks for reading!

Literature & Links:

[1] CACTUS = CADD Group Chemoinformatics Tools and User Services, Link to CACTUS database: http://cactus.nci.nih.gov

[2] Link to PubChem database’s API: https://pubchem.ncbi.nlm.nih.gov/rest/pug

[3] Picture by Gerd Altmann from https://pixabay.com

[4] SMILES = Simplified Molecular Input Line Entry Specification: https://www.daylight.com/dayhtml/doc/theory/theory.smiles.html

Extracting SMILES codes from open chemical databases

Published by Angelika Keller on August 25, 2021August 25, 2021

Literature & Links:

0 Comments

Leave a Reply Cancel reply

Python

PyAutoGUI-Bot for calculation of biodegradation values in EPI Suite

Python

Covid Invasion – Special Edition now available

Python

Machine Learning with Python on Heart Failure Dataset

Extracting SMILES codes from open chemical databases

Published by Angelika Keller on August 25, 2021August 25, 2021

Literature & Links:

0 Comments

Leave a Reply Cancel reply

Related Posts

Python

PyAutoGUI-Bot for calculation of biodegradation values in EPI Suite

Python

Covid Invasion – Special Edition now available

Python

Machine Learning with Python on Heart Failure Dataset