Machine learning : Introduction to Scikit-Learn, Learn how to create your first machine learning model!
Create and train a Random Forest classifier on a real heart disease dataset with python and Scikit-learn library.
Introduction
Are you interested in machine learning and don't know where to start? This blog post is for you. In this tutorial, I will guide you step by step to build your first machine-learning model using the scikit-learn library, this library is great, there are many books written about it, but by the end of this tutorial, you will learn what Scikit-learn is and how to create your first machine learning model using sickit-learn. we will also create our own dataset from scratch, so stay tuned!
Requirements
Basic knowledge of python.
Python & sickit-learn installed.
What is Scikit-learn?
Scikit-learn is a machine learning library in python that helps us create our machine learning models, which means that if we have data, Scikit-learn provides us with tools to create models to learn patterns from that data and perform tasks such as classification, regression and clustering.
Classification uses predefined classes in which objects are assigned. Clustering identifies similarities between objects, which it groups according to these common characteristics. Regression attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables).
Scikit-learn also implements tools to help us evaluate these predictions and thus assess the performance of our model.
Why Scikit-learn?
Buil on Numpy and matplotlib.
Has many built-in machine learning models.
Methods to evaluate the machine learning models.
Very well-designed API.
Steps of our workflow :
Collect the data.
Pick a model.
Fit the model to the data and make predictions.
Evaluate the model.
Improve through experimentation.
Save and reload your trained model.
Where to get help?
If you don't know what a function does, you can get help in :
The documentation of the scikit-learn library : Scikit-learn documentation
Stackoverflow : stackoverflow
Collect the data
Let's use the heart disease dataset from kaggle : Download the heart disease dataset.
First, let's load our dataset :
import pandas as pd
heart_disease = pd.read_csv("heart-disease.csv")
Then let's visualize it :
print(heart_disease.head())
To train a supervised machine learning model, we need an X (the data matrix, often called the feature matrix) and a Y (labels). Here, we choose the target as the label (the target indicates whether the patient has heart disease or not). And the other columns as X. So let's prepare our data :
X = heart_disease.drop("target", axis = 1)
X.head()
So here, we kept all the columns of our dataframe except the target. We specify the axis to be equal to one to delete the column.
Now, let's create the label matrix :
labels = heart_disease['target']
labels.head()
Now we have two matrices, the feature matrix X and the labels.
Before feeding our data into our model, let's start by visualizing them. For example, let's visualize cholesterol as a function of age.
## Explore our data
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(14,10))
ax.grid()
ax.bar(heart_disease['age'], heart_disease['chol'])
ax.set(title="Cholesterol by age",
xlabel="age",
ylabel="Cholesterol");
ax.axhline(heart_disease['chol'].mean(), color="orange", linewidth=4, label="Mean")
ax.text(40, heart_disease['chol'].mean(), 'Mean : {:.2f}'.format(heart_disease['chol'].mean()), fontsize=15, va='center', ha='center', backgroundcolor='w')
fig.savefig("cholesterol.png")
Pick a model
Let's choose a model from the Scikit-learn library, we will use the Random Forest classifier, so let's import it and look at its default parameters:
## Choose the right model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
## Keep the default hyperparameters
clf.get_params()
The output is :
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
These are the hyperparameters of our model, we will study them in detail in previous articles, but for now, let's focus on our problem.
We want to keep some data for inference, and the majority of our dataset will be devoted to training, which is essential for evaluating the model, otherwise we cannot know whether our model is performing well or not. In fact, sometimes the model performs very well on the training dataset, and when it comes to the test dataset, it performs very poorly, so we cannot rely on the training dataset alone to evaluate the model.
Therefore, we split our dataset as shown in the next figure :
80% of our dataset will be used for training and 20% for testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)
Fit the model
Now, let's fit our model on our training dataset :
clf.fit(X_train, y_train)
Let's make some predictions, to make a prediction on a single array, let's create a random array with 13 columns and then predict its label using our model :
import numpy as np
# make a prediction
x_01 = np.random.rand(1, 13)
y_label = clf.predict(np.array(x_01))
print(y_label)
Perfect! now let's evaluate our model on the test dataset.
Evaluate the model
First, let's predict the X_test dataset :
y_pred = clf.predict(X_test)
y_pred
The output is something like that:
array([1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1], dtype=int64)
Now, let's evaluate the model on the training data :
clf.score(X_train, y_train)
The output is :
1.0
So the model has an accuracy of 100% on the training dataset.
But what is the accuracy?
The accuracy measures how many observations, both positive and negative were correctly classified. ACC=TP+TNTP+TN+FP+FN Where : TP = True positive : the model predicts that the class is positive & correctly classified
FP = False positive : the model predicts that the class is positive & miss-classified
TN = True negative : the model predicts that the class is negative & correctly classified
FN = False negative : the model predicts that the class is negative & miss-classified
clf.score(X_test, y_test)
0.8360655737704918
Excellent! we have reached a 83.6% accuracy on the test dataset!
We can also visualize more metrics than the accuracy, for that, let's import classification_report, confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_preds))
Let's have a look on the confusion matrix as well :
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import plot_confusion_matrix
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf, X_test, y_test, ax=ax)
For this tutorial, let's keep it simple. In the next articles, we will see in details what are these metrics and how to intrepret them.
Improving the model
import numpy as np
np.random.seed(42)
for i in range(10, 100, 10):
print(f"Trying with {i} estimators..")
clf = RandomForestClassifier(n_estimators = i)
clf.fit(X_train, y_train)
print(f"Model accuracy on test set : {clf.score(X_test, y_test) * 100}")
print(" ")
Trying with 10 estimators..
Model accuracy on test set : 77.04918032786885
Trying with 20 estimators..
Model accuracy on test set : 81.9672131147541
Trying with 30 estimators..
Model accuracy on test set : 80.32786885245902
Trying with 40 estimators..
Model accuracy on test set : 81.9672131147541
Trying with 50 estimators..
Model accuracy on test set : 81.9672131147541
Trying with 60 estimators..
Model accuracy on test set : 81.9672131147541
Trying with 70 estimators..
Model accuracy on test set : 86.88524590163934
Trying with 80 estimators..
Model accuracy on test set : 83.60655737704919
Trying with 90 estimators..
Model accuracy on test set : 86.88524590163934
Save and load the model
To save the model, we will use pickle library :
import pickle
pickle.dump(clf, open("Random_forest_model.pkl", "wb"))
To load a model we'll use load function :
loaded_model = pickle.load(open("Random_forest_model.pkl", "rb"))
loaded_model.score(X_test, y_test)
0.8688524590163934
Finally ..
This blog post is intended to give you a quick overview of how to create your own model and train it on a real world dataset. In future posts, we will try to break down the algorithms used such as the random forest as well as the choice of its hyperparameters, we will see in detail what evaluation metrics can be used. Thanks to all! Stay tuned!