Since I am quite new to Machine Learning (ML), I was inspired by the application of ML on a huge variety of different data. One example that caught my eye was the heart failure prediction dataset[1] and the Python code[2] for the stroke data, both dataset and code found on www.kaggle.com. In addition, I have used bits of the very good example code in the ML introduction book ‘Machine Learning with Python for Everyone’ by Mark. E. Fenner[3] (which I strongly recommend as starter for ML novices 😊). Based on the mentioned code and with the free dataset on heart failure at hand, I decided to try a simple ML project by myself. Let’s have a look on what I did there.

During my time of learning and exploring the world of Python Programming and Data Analysis, I was curious what ‘Machine Learning’, a topic that lots of folks are mentioning these days, is about. So I decided to have a closer look at the principles of Machine Learning algorithms and the application in Python, respectively in a Jupyter Notebook[4]. Thus, I wrote the following code originally in a Jupyter Notebook which I’ve already published on Kaggle[5], with the only difference that graphic visualization is not implemented in the code shown below, as is the case in the version published on Kaggle.

Starting by importing the relevant libraries, that is sklearn[6], numpy[7] and pandas[8] and the heart failure dataset as csv file, the data was then converted into a pandas dataframe. As dataframe df, one can easily evaluate the data’s characteristics, such as number of columns and rows, data types, mean values etc.

Knowing these characteristics, I could standardize the data to prepare it for the use in the ML training and testing of five selected models. Before that, I separated the column ‘HeartDisease’ from the rest of the columns, since this column is the ‘target’ named as variable y, heart disease respectively failure yes (1) or no (0) we would like to predict at a later stage. The other columns are the so-called ‘features’, assigned with the variable X, such as Heart rate, Cholesterol, Chest Pain Type etc.

#! python3 - MLHeartDisease.py
import sklearn
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import *
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier

# Import and view the heart failure dataset
df = pd.read_csv("../input/heart-failure-prediction/heart.csv")

print(df.head(5)) # HeartDisease: 0 = no, 1 = yes
print(df.shape)
print(df.info())
print(df.describe().T)
   
# Standardize the dataset
X = df.drop(["HeartDisease"], axis=1)
y = df["HeartDisease"]
y = pd.DataFrame(y, columns=["HeartDisease"])
print(X.head()) # these are our features
print(y.head()) # this is our target
numerical_cols = X.select_dtypes(["float64","int64"])
scaler = sklearn.preprocessing.StandardScaler()
X[numerical_cols.columns] = scaler.fit_transform(X[numerical_cols.columns])

# One hot encoding should be done on this dataset to convert all non-numeric into numeric features.
categorical_cols = X.select_dtypes("object")
X = pd.get_dummies(X, columns=categorical_cols.columns)
print(X.head())

# Split the heart failure dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Prepare DataFrame for the Accuracy results
models = pd.DataFrame(columns=["Model","Accuracy Score"])

# Define the models we want to train and test
classifiers = {'logReg' : LogisticRegression(),
               'GradBC' : GradientBoostingClassifier(),
               'randomforest' : RandomForestClassifier(n_estimators=1000, random_state=42),
               'DT' : DecisionTreeClassifier(max_depth=3),
               'AdaBC' : AdaBoostClassifier()}

# Fit and predict each of the five models 
for model_name, model in classifiers.items():
    model.fit(X_train, y_train.values.ravel())
    predictions = model.predict(X_test)
    score = accuracy_score(predictions, y_test)
    new_row={"Model": model_name, "Accuracy Score": score}
    models = models.append(new_row, ignore_index=True)

# Sort the results
print(models.sort_values(by="Accuracy Score", ascending=False))

# Take a deeper look into one of the models with highest accuracy,
# meaning for the sake of simplicity only for Logistic Regression and its metrics.

# First, take a look at the non-optimized logReg model:
logReg = LogisticRegression()

scores = []
for r in range(20):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    fit = logReg.fit(X_train, y_train.values.ravel())
    pred = fit.predict(X_test)
    score = sklearn.metrics.mean_squared_error(y_test, pred)
    scores.append(score)
    
scores = pd.Series(np.sqrt(sorted(scores)))
df = pd.DataFrame({'RMSE': scores})
df.index.name = 'Repeat'
print(df.T)
# The RMSE (error) value of our non-optimized logReg model is quite low, that's good!

# The Confusion Matrix cm of correct and incorrect predictions
cm = sklearn.metrics.confusion_matrix(y_test, pred)
print(cm)
# So far, the model predicts very well True-positive cases and True-negative cases.

# This is to find out what hyperparameters are available for LogisticRegression
print(sklearn.linear_model.LogisticRegression.get_params(logReg).keys())

# Try to improve the hyperparameters of the logReg model with GridSearchCV
param_grid = [    
    {'penalty' : ['l1', 'l2'],
    'solver' : ['liblinear','saga']}
]

# Now create the GridSearch object
grid_model = sklearn.model_selection.GridSearchCV(logReg,
                               return_train_score = True,
                               param_grid = param_grid,
                               cv = 20,
                               verbose = True,
                               n_jobs = -1)

# Fit the grid_model on the heart failure data
best_model = grid_model.fit(X_train, y_train.values.ravel())

param_cols = ['param_penalty']
score_cols = ['mean_train_score', 'std_train_score',
              'mean_test_score', 'std_test_score']

# Look at first 5 params with head
df = pd.DataFrame(grid_model.cv_results_).head(10)

print(df[param_cols + score_cols])

param_cols = ['param_solver']
score_cols = ['mean_train_score', 'std_train_score',
              'mean_test_score', 'std_test_score']

# Look at first 5 params with head
df = pd.DataFrame(grid_model.cv_results_).head(10)

print(df[param_cols + score_cols])

Since the features X are then used for the training and prediction of heart disease, all numerical values in the features X had to be standardized and the non-numeric values converted into numeric ones via one hot encoding[9].

This standardized data was then split into training feature data (train_X), training target data (train_y), test feature data (test_X), test target data (test_y), with test_size set to 0.3 (meaning 70 % training data, 30 % test data) and random_state=42 (avoiding randomness in how the data is split).

The idea is then to ‘train’ the ML models with the training feature data and the training test data, so that prediction of the test target data based on the ‘unknown’ test feature data is achieved in the highest accuracy as possible.

As we see in the code, in preparation for the ML training and prediction (fit and predict), I opened a new dataframe named models for collection of each model name and its accuracy, respectively. Secondly, the five models are instantiated in the dictionary named classifiers. In the next step, we loop over classifiers and perform fit and predict for each one of the five models therein, determining the accuracy value for each model as score. Sorting and printing the models dataframe gives a ranking on the accuracy scores from highest to lowest value.

As a result, all five models performed well, with an accuracy between 0.83 and 0.88 and Logistic Regression as one of the best models with highest accuracy (0.87 – 0.88). Thus, I selected Logistic Regression for further evaluation and optimization steps:

For this purpose, a new instance of a Logistic Regression model named logReg was created and its RMSE (root-mean-squared-error) value was calculated in 20 repeated ‘fit and predict’ runs each, collecting each obtained RMSE value in the list named scores. Here, the np.sqrt-function does the trick to convert the error value in squared error values. As a result, the RMSE was always in the range of 0.2 – 0.5, meaning very low and thus quite good.

Furthermore, I checked if the Logistic Regression model’s accuracy could even be enhanced by optimizing its hyperparameters. Therefore, I had a look upon the available parameters for this model by using the get_params(logReg).keys() function. I decided to select the two hyperparameters 'penalty' and 'solver' for evaluation if they can be optimized by using GridSearchCV. Regarding this step, the data had to be split again in a training and testing part and the grid_model was set up to evaluate, based on the split data, the best hyperparameters each. As a result, I saw that the model for this dataset cannot be optimized for the two hyperparameters 'penalty' and 'solver', respectively. However, this might be different for another similar dataset.

Although the shown ML example is quite a ‘simple’ case, to me it demonstrates the complexity of ML and yet it is only scratching the surface on what is possible – and probably impossible – with ML in Python 😊. But I can clearly say that I have enjoyed this project a lot and will certainly come back to ML again with other data and questions to solve.

Literature & Sources:

[1] Heart failure dataset available at Kaggle: https://www.kaggle.com/fedesoriano/heart-failure-prediction

[2] Code on stroke dataset by Emre Arslan, https://www.kaggle.com/emrearslan123/machine-learning-on-stroke-prediction-dataset

[3] ‘Machine Learning with Python for Everyone’ by Mark. E. Fenner, Addison-Wesley, ISBN-13: 978-0-13-484562-3, ISBN-10: 0-13-484562-5

[4] Jupyter Notebook official website: https://jupyter.org/

[5] My code at Kaggle (with graphic visualization) https://www.kaggle.com/angelikakeller/ml-with-python-on-heart-failure-dataset

[6] Scikit Learn (aka sklearn) official website: https://scikit-learn.org/stable/

[7] NumPy official website: https://numpy.org/

[8] Pandas official website: https://pandas.pydata.org/

[9] More information on one hot encoding can be found here: https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

Title Picture by Raman Oza at www.pixabay.com


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *