Tutorial 0 : Introduction to NLP (Natural Language Processing)
A tutorial for beginners, bag of words and support vector machines.
Introduction
In this tutorial, we will try to create a simple text classifier following these steps :
Understand how the bag of words model work.
Build a small text dataset.
Convert this text dataset into numbers (to feed them to the classifier).
Train a machine learning model (an SVM) to classify the dataset as two categories of text.
Requirements
Basic python knowledge.
Python 3 and sklearn installed.
How the Bag of Words Model work?
The bag-of-words model represents a text (such as a sentence or a document) as the bag of its words.
To understand how the bag of words model works, let's take a simple example and consider the following texts :
"I love this book"
"this is a great book"
"Your hat is great"
"I love the shoes"
The principle of the bag of words model, is to build a list containing feature names of the previous texts of length N:
- ['book', 'fit', 'great', 'is', 'love', 'shoes', 'the', 'this']
Then, each sentence is represented by a list of N columns and one row, each column contains the number of occurrences of the word in the sentence. :
[1 0 0 0 1 0 1 0]
[1 0 1 1 0 0 0 1]
[0 1 1 1 0 0 1 0]
[0 0 0 0 1 1 1 0]
Build the classifier using python and sklearn
Objective :
The objective is to build a model that takes a sentence as input and classifies it as BOOKS or CARS.
Bag of words in python
Now that we understand how the bag-of-words model works, let's code it in python, for this we will need the CountVectorizer from sklearn.feature_extraction.text:. from sklearn.feature_extraction.text import CountVectorizer
Now we will define our "BOOKS" and "CARS" categories :
class Category:
BOOKS = "BOOKS"
CARS= "CARS"
Next, we will create our simple training dataset, we will create the sentences and their corresponding labels, for example "I like the book" corresponds to the category BOOKS.
train_x = [
"I love the book",
"this is a great book",
"The cover of this novel is good",
"this car is red",
"I like the colour of this car",
"This vehicle is amazing !"]
train_y = [
Category.BOOKS,
Category.BOOKS,
Category.BOOKS,
Category.CARS,
Category.CARS,
Category.CARS]
Now, let's build our model : vectorizer = CountVectorizer()
Let's fit the bag of words model to our training dataset : vectors = vectorizer.fit_transform(train_x)
Support Vector Machines Model
Now that we have converted our text into vectors, let's train a machine learning model to classify these vectors, for this we choose Support Vector Classifier (SVC) from the sklearn library : from sklearn import svm clf_svm = svm.SVC(kernel='linear') clf_svm.fit(vectors, train_y)
Now that we have trained our model, to test it, we must first define the text :
text = "I like the book"
test_x = vectorizer.transform([text])
clf_svm.predict(test_x)
Finally ..
This is a very simple tutorial to NLP, you can observe that for the model, the word books and book are not the same, we will try to solve this kind of problems in the next tutorials and we will cover other concepts too, thanks!