Machine learning : Introduction to Scikit-Learn, Learn how to create your first machine learning model!

Machine learning : Introduction to Scikit-Learn, Learn how to create your first machine learning model!

Create and train a Random Forest classifier on a real heart disease dataset with python and Scikit-learn library.

Introduction

Are you interested in machine learning and don't know where to start? This blog post is for you. In this tutorial, I will guide you step by step to build your first machine-learning model using the scikit-learn library, this library is great, there are many books written about it, but by the end of this tutorial, you will learn what Scikit-learn is and how to create your first machine learning model using sickit-learn. we will also create our own dataset from scratch, so stay tuned!

Requirements

  • Basic knowledge of python.

  • Python & sickit-learn installed.

What is Scikit-learn?

Scikit-learn is a machine learning library in python that helps us create our machine learning models, which means that if we have data, Scikit-learn provides us with tools to create models to learn patterns from that data and perform tasks such as classification, regression and clustering.

Classification uses predefined classes in which objects are assigned. Clustering identifies similarities between objects, which it groups according to these common characteristics. Regression attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables).

Scikit-learn also implements tools to help us evaluate these predictions and thus assess the performance of our model.

Why Scikit-learn?

  • Buil on Numpy and matplotlib.

  • Has many built-in machine learning models.

  • Methods to evaluate the machine learning models.

  • Very well-designed API.

Steps of our workflow :

  1. Collect the data.

  2. Pick a model.

  3. Fit the model to the data and make predictions.

  4. Evaluate the model.

  5. Improve through experimentation.

  6. Save and reload your trained model.

Data.png

Where to get help?

If you don't know what a function does, you can get help in :

Collect the data

Let's use the heart disease dataset from kaggle : Download the heart disease dataset.
First, let's load our dataset :

import pandas as pd
heart_disease = pd.read_csv("heart-disease.csv")

Then let's visualize it :

print(heart_disease.head())

mytable.png

To train a supervised machine learning model, we need an X (the data matrix, often called the feature matrix) and a Y (labels). Here, we choose the target as the label (the target indicates whether the patient has heart disease or not). And the other columns as X. So let's prepare our data :

X = heart_disease.drop("target", axis = 1)
X.head()

mytable.png

So here, we kept all the columns of our dataframe except the target. We specify the axis to be equal to one to delete the column.
Now, let's create the label matrix :

labels = heart_disease['target']
labels.head()

Now we have two matrices, the feature matrix X and the labels.

Before feeding our data into our model, let's start by visualizing them. For example, let's visualize cholesterol as a function of age.

## Explore our data
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(14,10))
ax.grid()
ax.bar(heart_disease['age'], heart_disease['chol'])
ax.set(title="Cholesterol by age", 
    xlabel="age", 
    ylabel="Cholesterol");
ax.axhline(heart_disease['chol'].mean(), color="orange", linewidth=4, label="Mean")
ax.text(40, heart_disease['chol'].mean(), 'Mean : {:.2f}'.format(heart_disease['chol'].mean()), fontsize=15, va='center', ha='center', backgroundcolor='w')
fig.savefig("cholesterol.png")

ok.png

Pick a model

Let's choose a model from the Scikit-learn library, we will use the Random Forest classifier, so let's import it and look at its default parameters:

## Choose the right model
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

## Keep the default hyperparameters
clf.get_params()

The output is :

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}

These are the hyperparameters of our model, we will study them in detail in previous articles, but for now, let's focus on our problem.

We want to keep some data for inference, and the majority of our dataset will be devoted to training, which is essential for evaluating the model, otherwise we cannot know whether our model is performing well or not. In fact, sometimes the model performs very well on the training dataset, and when it comes to the test dataset, it performs very poorly, so we cannot rely on the training dataset alone to evaluate the model.
Therefore, we split our dataset as shown in the next figure :

Training Samples.png

80% of our dataset will be used for training and 20% for testing.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2)

Fit the model

Now, let's fit our model on our training dataset :

clf.fit(X_train, y_train)

Let's make some predictions, to make a prediction on a single array, let's create a random array with 13 columns and then predict its label using our model :

import numpy as np
# make a prediction 
x_01 = np.random.rand(1, 13)
y_label = clf.predict(np.array(x_01))
print(y_label)

Perfect! now let's evaluate our model on the test dataset.

Evaluate the model

First, let's predict the X_test dataset :

y_pred = clf.predict(X_test)
y_pred

The output is something like that:

array([1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1], dtype=int64)

Now, let's evaluate the model on the training data :

clf.score(X_train, y_train)

The output is :

1.0

So the model has an accuracy of 100% on the training dataset.
But what is the accuracy?

The accuracy measures how many observations, both positive and negative were correctly classified. ACC=TP+TNTP+TN+FP+FN Where : TP = True positive : the model predicts that the class is positive & correctly classified
FP = False positive : the model predicts that the class is positive & miss-classified
TN = True negative : the model predicts that the class is negative & correctly classified
FN = False negative : the model predicts that the class is negative & miss-classified

clf.score(X_test, y_test)

0.8360655737704918

Excellent! we have reached a 83.6% accuracy on the test dataset!
We can also visualize more metrics than the accuracy, for that, let's import classification_report, confusion_matrix

from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(y_test, y_preds))

ok.JPG

Let's have a look on the confusion matrix as well :

import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import plot_confusion_matrix
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(clf, X_test, y_test, ax=ax)

ok.png

For this tutorial, let's keep it simple. In the next articles, we will see in details what are these metrics and how to intrepret them.

Improving the model

import numpy as np
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying with {i} estimators..")
    clf = RandomForestClassifier(n_estimators = i)
    clf.fit(X_train, y_train)
    print(f"Model accuracy on test set : {clf.score(X_test, y_test) * 100}")
    print(" ")

Trying with 10 estimators..
Model accuracy on test set : 77.04918032786885
Trying with 20 estimators..
Model accuracy on test set : 81.9672131147541
Trying with 30 estimators..
Model accuracy on test set : 80.32786885245902
Trying with 40 estimators..
Model accuracy on test set : 81.9672131147541
Trying with 50 estimators..
Model accuracy on test set : 81.9672131147541
Trying with 60 estimators..
Model accuracy on test set : 81.9672131147541
Trying with 70 estimators..
Model accuracy on test set : 86.88524590163934
Trying with 80 estimators..
Model accuracy on test set : 83.60655737704919
Trying with 90 estimators..
Model accuracy on test set : 86.88524590163934

Save and load the model

To save the model, we will use pickle library :

import pickle
pickle.dump(clf, open("Random_forest_model.pkl", "wb"))

To load a model we'll use load function :

loaded_model = pickle.load(open("Random_forest_model.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.8688524590163934

Finally ..

This blog post is intended to give you a quick overview of how to create your own model and train it on a real world dataset. In future posts, we will try to break down the algorithms used such as the random forest as well as the choice of its hyperparameters, we will see in detail what evaluation metrics can be used. Thanks to all! Stay tuned!

Did you find this article valuable?

Support Interrupt101 by becoming a sponsor. Any amount is appreciated!