{"id":379,"date":"2021-11-13T19:52:09","date_gmt":"2021-11-13T19:52:09","guid":{"rendered":"https:\/\/kellerbits.net\/wordpress\/?p=379"},"modified":"2021-11-13T19:52:10","modified_gmt":"2021-11-13T19:52:10","slug":"machine-learning-with-python-on-heart-failure-dataset","status":"publish","type":"post","link":"https:\/\/kellerbits.net\/wordpress\/?p=379","title":{"rendered":"Machine Learning with Python on Heart Failure Dataset"},"content":{"rendered":"\n<p><em>Since I am quite new to Machine Learning (ML), I was inspired by the application of ML on a huge variety of different data. One example that caught my eye was the heart failure prediction dataset<sup>[1]<\/sup> and the Python code<sup>[2]<\/sup> for the stroke data, both dataset and code found<\/em> <em>on www.kaggle.com. In addition, I have used bits of the very good example code in the ML introduction book &#8216;Machine Learning with Python for Everyone&#8217; by Mark. E. Fenner<sup>[3]<\/sup> (which I strongly recommend as starter for ML novices \ud83d\ude0a). Based on the mentioned code and with the free dataset on heart failure at hand, I decided<\/em> <em>to try a simple ML project by myself. Let\u2019s have a look on what I did there.<\/em><\/p>\n\n\n\n<p>During my time of learning and exploring the world of Python Programming and Data Analysis, I was curious what &#8216;Machine Learning&#8217;, a topic that lots of folks are mentioning these days, is about. So I decided to have a closer look at the principles of Machine Learning algorithms and the application in Python, respectively in a Jupyter Notebook<sup>[4]<\/sup>. Thus, I wrote the following code originally in a Jupyter Notebook which I&#8217;ve already published on Kaggle<sup>[5]<\/sup>, with the only difference that graphic visualization is not implemented in the code shown below, as is the case in the version published on Kaggle.<\/p>\n\n\n\n<p>Starting by importing the relevant libraries, that is <code>sklearn<\/code><sup>[6]<\/sup>, <code>numpy<\/code><sup>[7]<\/sup> and <code>pandas<\/code><sup>[8]<\/sup> and the heart failure dataset as csv file, the data was then converted into a <code>pandas<\/code> dataframe. As dataframe <code>df<\/code>, one can easily evaluate the data\u2019s characteristics, such as number of columns and rows, data types, mean values etc.<\/p>\n\n\n\n<p>Knowing these characteristics, I could standardize the data to prepare it for the use in the ML training and testing of five selected models. Before that, I separated the column &#8216;HeartDisease&#8217; from the rest of the columns, since this column is the &#8216;target&#8217; named as variable <code>y<\/code>, heart disease respectively failure yes (1) or no (0) we would like to predict at a later stage. The other columns are the so-called &#8216;features&#8217;, assigned with the variable <code>X<\/code>, such as Heart rate, Cholesterol, Chest Pain Type etc.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#! python3 - MLHeartDisease.py\nimport sklearn\nimport numpy as np\nimport pandas as pd\n\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.metrics import *\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier\nfrom sklearn.tree import DecisionTreeClassifier\n\n# Import and view the heart failure dataset\ndf = pd.read_csv(\"..\/input\/heart-failure-prediction\/heart.csv\")\n\nprint(df.head(5)) # HeartDisease: 0 = no, 1 = yes\nprint(df.shape)\nprint(df.info())\nprint(df.describe().T)\n   \n# Standardize the dataset\nX = df.drop(&#091;\"HeartDisease\"], axis=1)\ny = df&#091;\"HeartDisease\"]\ny = pd.DataFrame(y, columns=&#091;\"HeartDisease\"])\nprint(X.head()) # these are our features\nprint(y.head()) # this is our target\nnumerical_cols = X.select_dtypes(&#091;\"float64\",\"int64\"])\nscaler = sklearn.preprocessing.StandardScaler()\nX&#091;numerical_cols.columns] = scaler.fit_transform(X&#091;numerical_cols.columns])\n\n# One hot encoding should be done on this dataset to convert all non-numeric into numeric features.\ncategorical_cols = X.select_dtypes(\"object\")\nX = pd.get_dummies(X, columns=categorical_cols.columns)\nprint(X.head())\n\n# Split the heart failure dataset\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n\n# Prepare DataFrame for the Accuracy results\nmodels = pd.DataFrame(columns=&#091;\"Model\",\"Accuracy Score\"])\n\n# Define the models we want to train and test\nclassifiers = {'logReg' : LogisticRegression(),\n               'GradBC' : GradientBoostingClassifier(),\n               'randomforest' : RandomForestClassifier(n_estimators=1000, random_state=42),\n               'DT' : DecisionTreeClassifier(max_depth=3),\n               'AdaBC' : AdaBoostClassifier()}\n\n# Fit and predict each of the five models \nfor model_name, model in classifiers.items():\n    model.fit(X_train, y_train.values.ravel())\n    predictions = model.predict(X_test)\n    score = accuracy_score(predictions, y_test)\n    new_row={\"Model\": model_name, \"Accuracy Score\": score}\n    models = models.append(new_row, ignore_index=True)\n\n# Sort the results\nprint(models.sort_values(by=\"Accuracy Score\", ascending=False))\n\n# Take a deeper look into one of the models with highest accuracy,\n# meaning for the sake of simplicity only for Logistic Regression and its metrics.\n\n# First, take a look at the non-optimized logReg model:\nlogReg = LogisticRegression()\n\nscores = &#091;]\nfor r in range(20):\n    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)\n    fit = logReg.fit(X_train, y_train.values.ravel())\n    pred = fit.predict(X_test)\n    score = sklearn.metrics.mean_squared_error(y_test, pred)\n    scores.append(score)\n    \nscores = pd.Series(np.sqrt(sorted(scores)))\ndf = pd.DataFrame({'RMSE': scores})\ndf.index.name = 'Repeat'\nprint(df.T)\n# The RMSE (error) value of our non-optimized logReg model is quite low, that's good!\n\n# The Confusion Matrix cm of correct and incorrect predictions\ncm = sklearn.metrics.confusion_matrix(y_test, pred)\nprint(cm)\n# So far, the model predicts very well True-positive cases and True-negative cases.\n\n# This is to find out what hyperparameters are available for LogisticRegression\nprint(sklearn.linear_model.LogisticRegression.get_params(logReg).keys())\n\n# Try to improve the hyperparameters of the logReg model with GridSearchCV\nparam_grid = &#091;    \n    {'penalty' : &#091;'l1', 'l2'],\n    'solver' : &#091;'liblinear','saga']}\n]\n\n# Now create the GridSearch object\ngrid_model = sklearn.model_selection.GridSearchCV(logReg,\n                               return_train_score = True,\n                               param_grid = param_grid,\n                               cv = 20,\n                               verbose = True,\n                               n_jobs = -1)\n\n# Fit the grid_model on the heart failure data\nbest_model = grid_model.fit(X_train, y_train.values.ravel())\n\nparam_cols = &#091;'param_penalty']\nscore_cols = &#091;'mean_train_score', 'std_train_score',\n              'mean_test_score', 'std_test_score']\n\n# Look at first 5 params with head\ndf = pd.DataFrame(grid_model.cv_results_).head(10)\n\nprint(df&#091;param_cols + score_cols])\n\nparam_cols = &#091;'param_solver']\nscore_cols = &#091;'mean_train_score', 'std_train_score',\n              'mean_test_score', 'std_test_score']\n\n# Look at first 5 params with head\ndf = pd.DataFrame(grid_model.cv_results_).head(10)\n\nprint(df&#091;param_cols + score_cols])\n<\/code><\/pre>\n\n\n\n<p>Since the features <code>X<\/code> are then used for the training and prediction of heart disease, all numerical values in the features <code>X<\/code> had to be standardized and the non-numeric values converted into numeric ones via one hot encoding<sup>[9]<\/sup>.<\/p>\n\n\n\n<p>This standardized data was then split into training feature data (<code>train_X<\/code>), training target data (<code>train_y<\/code>), test feature data (<code>test_X<\/code>), test target data (<code>test_y<\/code>), with <code>test_size<\/code> set to 0.3 (meaning 70 % training data, 30 % test data) and <code>random_state=42<\/code> (avoiding randomness in how the data is split).<\/p>\n\n\n\n<p>The idea is then to &#8216;train&#8217; the ML models with the training feature data and the training test data, so that prediction of the test target data based on the &#8216;unknown&#8217; test feature data is achieved in the highest accuracy as possible.<\/p>\n\n\n\n<p>As we see in the code, in preparation for the ML training and prediction (<code>fit<\/code> and <code>predict<\/code>), I opened a new dataframe named <code>models <\/code>for collection of each model name and its accuracy, respectively. Secondly, the five models are instantiated in the dictionary named <code>classifiers<\/code>. In the next step, we loop over classifiers and perform <code>fit <\/code>and <code>predict<\/code> for each one of the five models therein, determining the accuracy value for each model as <code>score<\/code>. Sorting and printing the <code>models<\/code> dataframe gives a ranking on the accuracy scores from highest to lowest value.<\/p>\n\n\n\n<p>As a result, all five models performed well, with an accuracy between 0.83 and 0.88 and<code> Logistic Regression<\/code> as one of the best models with highest accuracy (0.87 \u2013 0.88). Thus, I selected <code>Logistic Regression<\/code> for further evaluation and optimization steps:<\/p>\n\n\n\n<p>For this purpose, a new instance of a <code>Logistic Regression<\/code> model named <code>logReg<\/code> was created and its RMSE (root-mean-squared-error) value was calculated in 20 repeated &#8216;fit and predict&#8217; runs each, collecting each obtained RMSE value in the list named <code>scores<\/code>. Here, the <code>np.sqrt<\/code>-function does the trick to convert the error value in squared error values. As a result, the RMSE was always in the range of 0.2 \u2013 0.5, meaning very low and thus quite good.<\/p>\n\n\n\n<p>Furthermore, I checked if the <code>Logistic Regression<\/code> model\u2019s accuracy could even be enhanced by optimizing its hyperparameters. Therefore, I had a look upon the available parameters for this model by using the <code>get_params(logReg).keys() <\/code>function. I decided to select the two hyperparameters <code>'penalty'<\/code> and <code>'solver'<\/code> for evaluation if they can be optimized by using <code>GridSearchCV<\/code>. Regarding this step, the data had to be split again in a training and testing part and the <code>grid_model<\/code> was set up to evaluate, based on the split data, the best hyperparameters each. As a result, I saw that the model for this dataset cannot be optimized for the two hyperparameters <code>'penalty'<\/code> and <code>'solver'<\/code>, respectively. However, this might be different for another similar dataset.<\/p>\n\n\n\n<p>Although the shown ML example is quite a &#8216;simple&#8217; case, to me it demonstrates the complexity of ML and yet it is only scratching the surface on what is possible \u2013 and probably impossible &#8211; with ML in Python \ud83d\ude0a. But I can clearly say that I have enjoyed this project a lot and will certainly come back to ML again with other data and questions to solve.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Literature &amp; Sources:<\/strong><\/h2>\n\n\n\n<p>[1] Heart failure dataset available at Kaggle: <a href=\"https:\/\/www.kaggle.com\/fedesoriano\/heart-failure-prediction\">https:\/\/www.kaggle.com\/fedesoriano\/heart-failure-prediction<\/a><\/p>\n\n\n\n<p>[2] Code on stroke dataset by Emre Arslan, <a href=\"https:\/\/www.kaggle.com\/emrearslan123\/machine-learning-on-stroke-prediction-dataset\">https:\/\/www.kaggle.com\/emrearslan123\/machine-learning-on-stroke-prediction-dataset<\/a><\/p>\n\n\n\n<p>[3] &#8216;Machine Learning with Python for Everyone&#8217; by Mark. E. Fenner, Addison-Wesley, ISBN-13: 978-0-13-484562-3, ISBN-10: 0-13-484562-5<\/p>\n\n\n\n<p>[4] Jupyter Notebook official website: <a href=\"https:\/\/jupyter.org\/\">https:\/\/jupyter.org\/<\/a><\/p>\n\n\n\n<p>[5] My code at Kaggle (with graphic visualization) <a href=\"https:\/\/www.kaggle.com\/angelikakeller\/ml-with-python-on-heart-failure-dataset\">https:\/\/www.kaggle.com\/angelikakeller\/ml-with-python-on-heart-failure-dataset<\/a><\/p>\n\n\n\n<p>[6] Scikit Learn (aka sklearn) official website: <a href=\"https:\/\/scikit-learn.org\/stable\/\">https:\/\/scikit-learn.org\/stable\/<\/a><\/p>\n\n\n\n<p>[7] NumPy official website: <a href=\"https:\/\/numpy.org\/\">https:\/\/numpy.org\/<\/a><\/p>\n\n\n\n<p>[8] Pandas official website: <a href=\"https:\/\/pandas.pydata.org\/\">https:\/\/pandas.pydata.org\/<\/a><\/p>\n\n\n\n<p>[9] More information on one hot encoding can be found here: <a href=\"https:\/\/machinelearningmastery.com\/how-to-one-hot-encode-sequence-data-in-python\/\">https:\/\/machinelearningmastery.com\/how-to-one-hot-encode-sequence-data-in-python\/<\/a><\/p>\n\n\n\n<p>Title Picture by Raman Oza at www.pixabay.com <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Since I am quite new to Machine Learning (ML), I was inspired by the application of ML on a huge variety of different data. One example that caught my eye was the heart failure prediction dataset[1] and the Python code[2] for the stroke data, both dataset and code found on [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":380,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[10],"tags":[22,21],"class_list":["post-379","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-python","tag-jupyter-notebook","tag-machine-learning"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/kellerbits.net\/wordpress\/wp-content\/uploads\/2021\/11\/heart-g0169b389a_1920.jpg?fit=1920%2C1080&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/379","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=379"}],"version-history":[{"count":8,"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/379\/revisions"}],"predecessor-version":[{"id":391,"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/379\/revisions\/391"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=\/wp\/v2\/media\/380"}],"wp:attachment":[{"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=379"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=379"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/kellerbits.net\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=379"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}