Retrieve information from your IUCLID 6 database using Python

The standard database for preparing and submitting REACH dossiers to ECHA is the IUCLID 6 database^[1]. Thus, each chemical company using this database has a dataset of hundreds or even thousands of substance files available which can be accessed batch-wise via the IUCLID 6 REST API^[2]. Here, I have written a short Python script as a practical example for pulling out the information ‘Estimated quantities’ from any substance file available in the IUCLID 6 database. Prerequisite is that the user has access to a IUCLID 6 database filled with substance files of their own company and respective login credentials. The REST API itself should be already available if the user has installed the IUCLID 6 database on a local server and considering Python, only the standard modules os, re, json and requests are required.

If you are working in the chemical regulatory business and are responsible for preparing and submitting REACH dossiers to ECHA, you surely are familiar with the IUCLID 6 database.

This is the standard database for preparing the substance and dossier files of all REACH registrations submitted via REACH-IT. As more and more dossiers are prepared in this database, your IUCLID 6 system becomes stacked up with hundreds or even thousands of substance datasets of your company’s substances stored as i6z files.

What if you would like to pull out specific information per substance file without manually opening and exporting data from each individual file?

Here, the developers of the IUCLID 6 database implemented a REST API to the system with respective documentation. Unfortunately, although the documentation and tutorial give a good explanation on the API itself, there are only very few practical examples available on how to retrieve substance files’ data^[3]. Therefore, I was looking into developing and publishing a practical example how to retrieve data from IUCLID 6 via its API in Python.

The task in my case was to retrieve quantity data from the IUCLID chapter ‘3.2 Estimated quantities’ for at least three up to several dozens of IUCLID 6 substance files. You see the full Python script here:

#! python3 - getQuantitiesfromIUCLID6.py
# Connect to IUCLID6 REST API and collect all quantity information of specific substance files,
# print it in a txt file.

import os
import re
import json
import requests
from requests.auth import HTTPBasicAuth

# Specify your working directory.
os.chdir('C:/Path/to/your/working/directory')

# Store the login credentials separately from script for security reasons.
with open('creds.txt') as creds:
    credentials = creds.readlines()
    user = credentials[0].strip()
    password = credentials[1].strip()
    
# API address basis for all requests.
s_REST_API = 'https://your-IUCLID6-url/iuclid6-ext/api/ext/v1' 

# Regex pattern1 to parse the substance UUID from string, pattern2 to get Estimated quantities' UUIDs.
pattern1 = '([A-Z0-9\-]+)?[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}'
pattern2 = 'EstimatedQuantities\/([A-Z0-9\-]+)?[0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}'

# List of substance file names available for quantity collection from our IUCLID6 system.
Substances = ['Your_Substance_1', 'Your_Substance_2', 'Your_Substance_3'] 

def get_quantities_by_substance(substance): 
    '''Get all available estimated quantity data per substance file in our IUCLID6 system.'''
    json_obj = ''
    for substance in Substances:
        found_substance_files = []
        s_quantities = []
        
        # First, find substances by their Substance name each in our IUCLID6 database & catch their UUID each. 
        response1 = requests.get(s_REST_API + '/query/iuclid6/bySubstance?doc.type=SUBSTANCE&sub.chemical=' +  
                    substance, auth = HTTPBasicAuth(user, password),   
                    headers{'accept':'application/vnd.iuclid6.ext+json;type=iuclid6.Document'}, verify = False)
        # Load response as json string and determine length of found substance(s).
        s_reply1 = json.loads(response1.text)
        res_len_1 = len(s_reply1['results'])
        print('### size found substances:', res_len_1) # just a response check if substance(s) is found at all.
        found_substance_files.append(s_reply1['results'])
        s_UUID_s_multi = []
        try:
            for i in range(0, res_len_1):
                for found_substance in found_substance_files:
                    s_substance_UUID = re.search(pattern1, s_reply1['results'][i]['uri'])
                    s_UUID_s = s_substance_UUID.group()
                    s_UUID_s_multi.append(s_UUID_s)
            print(s_UUID_s_multi)
        except UnboundLocalError:
            print('No substance file for ' + substance + ' and respective UUID found.')
            continue

        # Second, get endpoint 'Estimated quantities'; request gives back all available entries for estimated 
        quantities per substance, each one has an own UUID.
        for s_UUID_s in s_UUID_s_multi:
            response2 = requests.get(s_REST_API + '/raw/SUBSTANCE/' + s_UUID_s + '/document
                        /FLEXIBLE_RECORD.EstimatedQuantities', auth = HTTPBasicAuth(user, password), headers=
                        {'accept':'application/vnd.iuclid6.ext+json; type=iuclid6.Document'}, verify = False)
            # Collect all found entries of estimated quantities in a list.
            s_reply2 = json.loads(response2.text)
            res_len_2 = len(s_reply2['results'])
            print('### size found quantity entries:', res_len_2, 'for substance file ', s_UUID_s) # just a response 
            check is substance file has quantity information.
            l_UUID_quants = []
            for i in range(0, res_len_2):
                s_quant_UUID = re.search(pattern2, s_reply2['results'][i]['uri'])
                s_UUID_quant = s_quant_UUID.group().replace('EstimatedQuantities/', '')
                l_UUID_quants.append(s_UUID_quant)

            # Third, for all identified substances and their associated quantities, make response3 request to get 
            # the text per quantity endpoint and store this information in a list. 
            for UUID in l_UUID_quants:
                response3 = requests.get(s_REST_API + '/raw/SUBSTANCE/' + s_UUID_s + '/document    
                            /FLEXIBLE_RECORD.EstimatedQuantities/' + UUID, auth = HTTPBasicAuth(user, password), 
                            headers={'accept':'application/vnd.iuclid6.ext+json; type=iuclid6.Document'}, verify = 
                            False)
                s_reply3 = json.loads(response3.text)
                s_quantities.append(s_reply3)
            
            # Finally, dump all quantity information related to each substance in a readable text file. 
            json_obj = json.dumps(s_quantities, indent = 4)
            filename = 'REACH_quantities_' + substance.replace('|', '-').replace('/', '-') + 
                       ' Substance_UUID ' + s_UUID_s + '.txt'
            with open(filename, 'w') as f:
                f.write("\nREACH Quantities of " + substance + ' Substance_UUID ' + s_UUID_s + "\n\n")
                f.write(json_obj)
                f.close()

if __name__ == '__main__':
    quantities = get_quantities_by_substance(Substances)

First, the required modules are imported, that is os for setting the working directory (only optional), requests for accessing the API, json for retrieving the data from the endpoint, and regular expression (re) to parse out the UUIDs of both the substance file and the endpoint. As mentioned earlier, pre-condition is that you have IUCLID 6 installed locally or at a local server and you have login credentials for your IUCLID 6 system. Also, the URL of your IUCLID 6 system is only known to you and should be used as base string for the variable s_REST_API.

To avoid leaving my login credentials hardcoded in this script, I had them stored separately from the script in a text file (creds.txt in the code). This file was opened and both username and password read and loaded into the variables ‘user’ and ‘password’, respectively.

Next, I used the re module to set up two patterns matching to the UUIDs of both the substance file itself (pattern1) as well as the ones for each quantity entry in IUCLID 6 chapter 3.2. Estimated quantities (pattern2). These patterns should work in general for any UUID in the IUCLID 6 system, but I can only state this for sure for the IUCLID 6 system I used. Anyway, you may have to try and find out yourself when using this script for your request if the search patterns work or would have to be slightly modified to fit your case.

Then I had to specify all the substances in a list (Substances) I would like to retrieve the quantity information from. Here it is mandatory that you enter the exact substance file name, e.g. if the substance file is simply named ‘formaldehyde’, you may insert it in the list like that, or if the name is ‘formaldehyde / CAS 50-00-0’, you have to insert this exact name in the list. Otherwise the substance file is not found.

Now we come to the main function in this script, ‘def get_quantities_by_substance()‘: Here, we iterate over the list Substances and make three API requests for each substance. The first API request looks for the searched substance if it is available in the IUCLID 6 database at all or even multiple files exist and stores them in a list (found_substance_files). Then, the substance file’s UUID must be extracted for each substance file found by matching with regex pattern1. Additionally, these found UUIDs have to be stored in a list (s_UUID_s_multi), too, since this is the list as a basis for the second API request: Looking for the endpoint ‘Estimated quantities’ per substance file. If quantity information is found for a substance file, its UUID is extracted by matching it with the regex pattern2 and appended to the list l_UUID_quants.

With these two lists at hand, the one storing the UUID(s) of the found substance(s) (i.e. s_UUID_s_multi) and the other storing the information of all found quantity entries per substance file (i.e. l_UUID_quants), the final API call is made to extract the quantity information per substance as json data. This json data is then written for each substance file to a separate text file containing the desired quantity information available in each substance file.

And that’s it! Now Python doesn’t care if you retrieve only 3 substance files’ quantity information or 30 or even more: You get the information in a very short time in handy text files without opening IUCLID 6 itself and extract all information manually.

Any questions or comments on this example? Just leave me a note in the comment box below.

Literature & resources:

[1] IUCLID6 official website and download: https://iuclid6.echa.europa.eu/de/

[2] IUCLID6 Public REST API: https://iuclid6.echa.europa.eu/de/public-api

[3] IUCLID6 Public REST API request example: https://iuclid6.echa.europa.eu/documents/21812392/23181267/IUCLID_public_rest_api_eg_rsr.pdf/96d8f1fa-459b-1a3d-00c3-35da3a8991c0

[4] ‘Retrieving labrador’ picture source: Jana Schmidt, www.pixabay.com

Retrieve information from your IUCLID 6 database using Python

Published by Angelika Keller on May 26, 2021May 26, 2021

0 Comments

Leave a Reply Cancel reply

Prediction of PMT and PBT substances with Logistic Regression in Jupyter notebook

PyAutoGUI-Bot for calculation of biodegradation values in EPI Suite

Covid Invasion – Special Edition now available

Retrieve information from your IUCLID 6 database using Python

Published by Angelika Keller on May 26, 2021May 26, 2021

0 Comments

Leave a Reply Cancel reply

Related Posts

Prediction of PMT and PBT substances with Logistic Regression in Jupyter notebook

PyAutoGUI-Bot for calculation of biodegradation values in EPI Suite

Covid Invasion – Special Edition now available