Prediction of PMT and PBT substances with Logistic Regression in Jupyter notebook

ABSTRACT

In the regulatory field, new hazard classes are emerging, notably persistent, mobile, and toxic (PMT) substances, which are of increasing concern due to their potential risks to aquatic environments. PMT substances are classified similarly as persistent, bioaccumulative, and toxic (PBT) substances. According to UBA guidance, a substance is considered PMT if it meets specific criteria for persistence, mobility, and toxicity. Thus, my new small Machine Learning (ML) project was aiming to allow a rough prediction of a substance being PMT, PBT, or none of these two categories. The main steps in this project were the following:

Data Collection: I started the project by gathering selected physico-chemical and (eco) toxicological data, i.e. feature data for over 650 substances from public databases like ECHA CHEM and PubChem, compiling the information into a CSV file.

Manual Evaluation: I had to classify each substance then manually into categories: 0 (neither PMT nor PBT), 1 (PMT), or 2 (PBT) based on the established criteria. For PBT classification, I mainly used the PBT assessment list published by ECHA.

Model Development: Using Jupyter notebook, I analyzed the substances’ dataset and prepared it for modeling. Non-numeric data was converted to numeric values, and the dataset was cleaned to ensure that no null values were contained. The data was then split into training (80 %) and test (20 %) datasets, with standardization applied to maintain consistent scales.

The ML model was created, fitted and trained using Logistic Regression, a suitable algorithm for classification tasks. The model achieved a high initial accuracy that could not be improved through cross-validation. Predictions were then made on the test dataset, and the model’s performance was evaluated using a confusion matrix and classification report, which provided metrics such as accuracy, precision, recall and F1 scores of the ML model implemented.

CONCLUSION

The Logistic Regression model effectively classifies substances as PMT, PBT, or neither of these two categories based on substances’ feature data. However, distinguishing between “very persistent and very bioaccumulative” (vPvB) and PBT as well as between “very persistent and very mobile” (vPvM) and PMT was not possible due to insufficient data for each substance in the dataset, especially (long-term) degradation data in water, soil, air etc. Thus, model refinement based on more data and fine-tuning was not yet done, but this project can be an initiative starting idea for further ML projects in the regulatory field.

According to chemical^[1]and classification^[2] regulations in the EU, you may have noticed that there are several new hazard classes that are more and more of interest to the authorities^[3]. One of these new categories are persistent, mobile and toxic (PMT) substances^[4] that are regarded as of similar concern like persistent, bioaccumulative and toxic substances (PBT)^[5], especially in terms of posing a risk to the aquatic environment. According to the UBA guidance documents, PMT substances have the following properties:

Persistence (P) is defined by various degradation half-lives in different environments, including marine and freshwater, sediments, and soil. For example, a substance is persistent if its degradation half-life in marine water at 9 °C exceeds 60 days.

Mobility (M) is determined by the organic carbon-water coefficient (log Koc), where a value below 4.0 indicates high mobility.

Toxicity (T) is assessed through criteria that include long-term no-observed effect concentrations (NOEC) below 0.01 mg/L, classification as carcinogenic or mutagenic, and evidence of chronic toxicity.

Given these criteria, I started a Machine Learning (ML) project aimed at predicting whether substances fall into the PMT, PBT, or none of these two categories based on their intrinsic data features, such as toxicity classifications like carcinogenic, mutagenic or toxic for reproduction (CMR), Specific Target Organ Toxicant (STOT) or half maximal effective concentration (EC50) values, biodegradability, and bioconcentration factor (BCF).

The idea was to find and train a ML model that classifies substances to be PMT (category 1), PBT (category 2), or none of these two categories (category 0) based on a few selected values (“features”) of a set of substances, which were the following:

STOT or Cramer Classification
CMR
Biodegradability
Log Koc
Log Pow
BCF
EC50

Therefore, the data had to be collected first and then, an algorithm suitable for this classification problem should be trained based on this preprocessed data to be able to predict PMT and/or PBT-properties for substances on test data with high accuracy, precision and recall. In my opinion, a suitable Python platform for this project is Jupyter notebook^[6], so I have set up the code there.

Now, let us go through the steps I have taken to tackle this goal:

Step 1: Collect all data for a set of over 650 substances which are the features for the prediction

For a large set of substances, the feature data mentioned above is publicly available on the ECHA CHEM database^[7] and the PubChem database^[8]. Especially for the latter one, data like the log Pow can be retrieved via API request. In some rare cases where both databases did not contain specific features, e.g. the BCF of a substance, it was calculated via EPI Suite^[9]. Regarding the EC50 values, a value was not found for every substance in the dataset, so this will have to be taken care of later.

Compiling the data of 657 substances in an Excel file and saving it as csv was the basis for step 2 then.

Step 2: Manually evaluate the substances and classify them

Based on the feature data per substance, I assigned to each substance a category aka a target number: 0 (neither PMT nor PBT), 1 (PMT) or 2 (PBT). My decision-making which of the substances are PMT and which ones are PBT was made based on the criteria laid down in the available guidance documents^[4,5]. Considering the classification as PBT substance, I also used the PBT assessment list^[10] published on the ECHA website: I included nearly a dozen of the substances from this list which were classified as PBT by the authorities in my dataset. The category aka target number per substance was recorded in the table in an extra column. Now, the dataset containing features and targets per substance was ready to be processed in Jupyter notebook.

Step 3: Predict and classify substances with Logistic Regression

At first, I uploaded the dataset in Jupyter notebook (see Figure 1), analyzed and prepared the loaded data for the algorithm. I will not show all detailed steps here (especially not the visualization part), but you can find the full code of this project on Kaggle^[11].

Figure 1: Loading and display of the raw dataset in Jupyter notebook

The key modules I used in the whole project are matplotlib^[12], pandas^[13], seaborn^[14], numpy^[15], and of course scikit-learn aka sklearn^[16]. After loading the data, one should check that everything is imported correctly by displaying the head of the raw dataset. Then we can explore the data with the describe() and dtypes method. We see in Figure 2 that the raw dataset contains two different data types, object and float64.

Figure 2: Explore the raw data

By calling the isnull().sum() methods from pandas, we also see in Figure 3 that the column “EC50” contains 512 null values which we have to address in the data preprocessing.

Figure 3: Check the dataset for null values

For cleaning the dataset (see Figure 4), at first the column “CAS” was dropped since it does not contain relevant information for our model. Then, the null values in the “EC50” column were filled with the value 100 representing an EC50 value of equal or greater than 100 mg/L.

Figure 4: Dropping unrequired columns and fill-in null values

After re-checking that no null values are present anymore (not shown in Figure 4), the columns with the non-numeric values (e.g. “CMR” column) were converted into numeric values with the replace and the map function (see Figure 5, only replacement with map function shown).

Figure 5: Non-numeric value conversion

As a last step prior to visualization of the data, all values in all columns had to be converted into float numbers so that the Logistic Regression classifier receives a uniform data format (see Figure 6). I decided to convert all columns to float instead of integer type not to lose the float numbers in the EC, log Koc and XLogP columns, respectively. This cleaned and encoded dataset was stored in a dataframe named PMT_data_enc.

Figure 6: Convert all columns to float type

For getting a first overview of the data and correlations between targets and features, Seaborn’s pairplot or heatmap (see Figure 7) can be used. A negative value between two features means that there is no correlation between these two, e.g. log Koc is not dependent on a substance being CMR, readily biodegradable or having a STOT classification. In contrast to that, we see in the heatmap that values closer to 1 show a correlation between each other, e.g. log Koc and XLogP (aka log Pow) strongly correlate with each other. Furthermore, the feature “STOT_or_Cramer_Class_III” moderately correlates with the category. This makes sense since a substance that is neither STOT-nor Cramer-classified does not fulfil the T (toxic) criterion required both in the PMT and in the PBT definition.

Figure 7: Heatmap of the cleaned dataset

After having prepared and visualized the dataset, we start with part 2 of the project and prepare the data for the modeling, meaning splitting and stratifying the data into a training and a test set (see Figure 8).

Figure 8: Split the data

The splitting ratio is 80 % training and 20 % test data and this is done by importing sklearn.model_selection and using its method train_test_split. Furthermore, by stratifying on y we assure that the different classes are represented proportionally to the amount in the total data, so this makes sure that all of category 1 or 2 is not in the test group only. Then, we have to standardize the data using the StandardScaler() class from sklearn. This puts the numbers of the dataset on a consistent scale while keeping the proportional relationship between them. The last step in section “Modeling the data” is a baseline prediction (see Figure 9):

Figure 9: Baseline prediction prior to our model training

The baseline is the probability of predicting class before the model is implemented. The goal of our model is to improve on this baseline, or random prediction. Also, if there is a strong class imbalance (if 90 % of the data was in class 0), then we could alter the proportion of each class to help the model predict more accurately. For that, we simply use the value_counts method from pandas to show how many percent are in category 0, 1 and 2. With ca. 83.9 % category 0 (neither PMT not PBT), ca. 13.2 % category 1 (PMT) and ca. 2.9 % category 2 (PBT) substances we have moderate proportions.

Now, we come to the core part of this ML project: Creating the model, fitting it, making predictions and cross validating the training data (see Figure 10).

Figure 10: Fit and predict with Logistic Regression model

Here, sklearn comes with a ready-to-use class LogisticRegression() that bears all relevant methods like fit and predict. The predicted values y_pred are the categories 0, 1 and 2 the model predicts for the test features X_test. Next, we determine the score of the model (see Figure 11).

Figure 11: Accuracy and cross-validation

The score is a measure for the accuracy and the closer it is to 1, the better the accuracy of our model. However, the score was 0.95 and thus quite good and could not be enhanced by cross-validation using the method cross_val_score from sklearn.model_selection.

We can also directly compare the categories our model predicts with the actual categories and display this (see Figure 12) by applying the predict method and display the result of the actual and predicted values (i.e. category 0, 1 and 2) in a dataframe.

Figure 12: Compare predicted and actual values

In order to evaluate true and false predictions by the model, we can calculate and display the confusion matrix from sklearn.metrics. There is a ready-to-use method confusion_matrix we can apply on the test data and predicted data and plot it graphically in a heatmap. Details on setting and formatting of the confusion matrix are shown in the code in Figure 13.

Figure 13: Confusion matrix code

The plot of the confusion matrix (see Figure 14) shows the predicted values in the columns, and the actual values in the rows. In this way we can see where the model makes true and false predictions, and if it predicts incorrectly, we can see which category it is predicting falsely.

Figure 14: Confusion matrix plot

Another good way to check how your model is performing is by looking at the classification report. This report can also be generated by the method classification_report from sklearn.metrics (see Figure 15).

Figure 15: Classification Report

It shows the precision, recall, f1-scores, and accuracy scores. The f1-score^[17]is a metric that also measures the accuracy of the model, combining the precision and recall values of the model. The accuracy metric calculates how many times a model made a correct prediction across the entire dataset. So, the closer the f1-score is to 1, the better the accuracy of the model. The other model parameters accuracy, precision and recall are calculated according to the following logic:

Accuracy = number predicted correctly / total number of all examples

Precision = number predicted as category 0, 1 or 2 that are actually 0, 1 or 2 divided by the total number predicted as category 0, 1 or 2

Recall = number predicted as category 0, 1 or 2 that are actually 0, 1 or 2 divided by the total number that are actually category 0, 1 or 2

Interpreting our results depicted in Figure 15, the overall accuracy of the model is quite high (0.92), the precision, f1-score and recall values for all three categories are also quite good, with only values lower than 0.8 for precision for category 1 and recall for category 2.

As a conclusion, we have a suitable model with Logistic Regression enabling classification of a substance as PMT, PBT or none of these two categories based on substances’ feature data. What we cannot distinguish with the model so far is “very persistent and very bioaccumulative” (vPvB) from PBT and “very persistent and very mobile” (vPvM) substances from PMT. This is due to the lack of long-term biodegradation data in soil and (marine) water that is required for classification as very persistent (vP) and very mobile (vM) [3]. A we also do not have the test data required to demonstrate that a substance meets the requirements to be very bioaccumulative (vB) and is not biodegradable under long-term degradation conditions ^[15].

I would like to highlight the fact that the main challenge of this project was to get all the data, especially the complete set of features per substance needed for training and testing of the model. Furthermore, fine-tuning of the model was also not done yet. Also, one might consider using either additional or other features (e.g. the NOEC values, as far as available), and/or a larger dataset with more substances of category 1 and 2 to enhance the model’s precision and recall. So, this means that there is a lot of room for improvement and further projects based on this one. But it might serve as a starter for your own ML project 😊.

Literature & Weblinks

Cover picture source: https://pixabay.com/de/photos/duck-nature-dive-water-vogel-4579153/

[1] REACH Regulation: https://environment.ec.europa.eu/topics/chemicals/reach-regulation_en

[2] CLP-Regulation: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A02008R1272-20231201

[3] Publication “Persistent, mobile and toxic (PMT) and very persistent and very mobile (vPvM) substances pose an equivalent level of concern to persistent, bioaccumulative and toxic (PBT) and very persistent and very bioaccumulative (vPvB) substances under REACH”, Hale et al., Environmental Sciences Europe, Article no.: 155 (2020), Weblink: https://enveurope.springeropen.com/articles/10.1186/s12302-020-00440-4#:~:text=There%20are%20many%20examples%20of,not%20conceptually%20linked%20in%20this

[4] PMT definition and criteria: https://www.umweltbundesamt.de/en/the-final-pmtvpvm-criteria-after-public

[5] PBT definition and criteria:

https://www.umweltbundesamt.de/themen/chemikalien/chemikalien-reach/stoffgruppen/eu-veroeffentlicht-neue-pbt-vpvb-kriterien-unter#:~:text=Persistente%2C%20bioakkumulierende%20und%20toxische%20(PBT,Gesundheit%20und%20die%20Umwelt%20dar.

[6] Jupyter notebook weblink: https://jupyter.org/

[7] ECHA Chem database link: https://chem.echa.europa.eu/

[8] PubChem database link: https://pubchem.ncbi.nlm.nih.gov/

[9] EPI Suite: https://www.epa.gov/tsca-screening-tools/epi-suitetm-estimation-program-interface

[10] ECHA PBT assessment list weblink: https://www.echa.europa.eu/de/web/guest/pbt

[11] kaggle publication link: https://www.kaggle.com/code/angelikakeller/pmt-pbt-substance-prediction

[12] matplotlib weblink: https://matplotlib.org/

[13] pandas user guide weblink: https://pandas.pydata.org/docs/user_guide/index.html

[14] seaborn weblink: https://seaborn.pydata.org/