Tutorial 1 : NLP (Natural Language Processing) with Spacy.

Introduction

In the previous tutorial we used a bag of words model with SVM, we saw that one of the limitations of the bag of words model is that it does not capture the semantic value of a word, for example, based on the bag of words, the word paper and book have nothing to do with each other. In this tutorial, we will try to solve this problem by using another model of the spacy library.

Instalation

Before we start, we need to install the spacy library which is a python library for natural language processing. We also need en_core_web_md which is a medium-sized English model trained on written web text. !pip install spacy !pip -m spacy download en_core_web_md

Build the model

Import dependencies

Now that our library is installed, we are ready to import spacy and load the : import spacy nlp = spacy.load("en_core_web_md")

Create the dataset

Let's take the same dataset as in the previous article: class Category: BOOKS = "BOOKS" CARS= "CARS"

train_x = [
        "I love the book", 
        "this is a great book", 
        "The cover of this novel is good",
        "this car is red", 
        "I like the colour of this car",
        "This vehicle is amazing !"]

train_y = [
        Category.BOOKS, 
        Category.BOOKS,
        Category.BOOKS, 
        Category.CARS, 
        Category.CARS,
        Category.CARS]

Convert to docs

Now, we will convert our texts to "docs" :

docs = [nlp(text) for text in train_x]
print(docs)

when we print these docs, we get the following output :

It may seem that the docs is a simple list of sentences, however, we can get multiple features from these sentences, for example, if we want to get the vector corresponding to one of the sentences, we can do so using the following command:

print(docs[0].vector) ``` We will then get something like this:

[-7.33089983e-01 -5.24749886e-03 -2.35488251e-01 1.59274936e-02 9.66347754e-02 1.56278491e-01 1.38615012e-01 -1.82292491e-01 8.84527490e-02 1.54077005e+00 -2.41762251e-01 -8.96672532e-02 1.74057245e-01 3.10127772e-02 4.62116897e-02 -5.05267493e-02 -1.48660004e-01 1.03792381e+00 -1.71565011e-01 -6.28000051e-02 1.03982493e-01 1.28997505e-01 1.35554761e-01 -2.06535250e-01 -2.21828252e-01 -1.54980987e-01 -2.25717485e-01 -2.63060927e-01 2.91349851e-02 9.59425047e-02 -2.11517513e-02 3.45300019e-01 -1.88805014e-01 1.19102523e-02 1.82815492e-01 1.35538995e-01 -1.14783749e-01 2.49261260e-01 -1.00740008e-01 6.52624816e-02 -1.29889250e-01 1.79949999e-02 -1.20909005e-01 -2.06174999e-02 1.49652511e-01 1.26080498e-01 4.98107485e-02 1.36212513e-01 -6.19465038e-02 1.98888257e-01 -1.23281501e-01 9.30762440e-02 -8.31630006e-02 -1.11451503e-02 3.28723229e-02 -1.49444744e-01 -3.78984734e-02 -1.56752497e-01 -1.67660996e-01 -1.64857253e-01 -1.43127844e-01 -1.18127748e-01 -1.83716238e-01 1.67531759e-01 2.22494990e-01 -2.22010002e-01 -9.77685004e-02 -1.69120744e-01 5.66497445e-02 -1.52496211e-02 3.93482484e-02 -3.64200026e-02 3.26526999e-01 -1.68735996e-01 -1.08596012e-01 7.39234984e-02 1.01182498e-01 6.10624962e-02 3.49889919e-02 3.65122497e-01 -1.50467515e-01 1.74932554e-02 2.23723099e-01 -2.22526252e-01 3.95052508e-02 5.91410026e-02 3.80437493e-01 -3.53190005e-01 1.39151245e-01 -1.05114751e-01 3.03142481e-02 -2.49097750e-01 -6.95675015e-02 5.91125041e-02 -3.53082493e-02 -1.52070746e-01 1.83142245e-01 2.06560358e-01 4.03746963e-04 -1.82502240e-01 ... -8.52750018e-02 1.61836058e-01 2.16748938e-01 -2.49117509e-01 -1.43582255e-01 -1.71492502e-01 -9.85517502e-02 -2.14638233e-01 -1.32202491e-01 -2.51144975e-01 -5.27722538e-02 1.50995508e-01 3.53277624e-02 -1.23240001e-01 1.14262253e-01 1.09529994e-01]] ```

Now, to train a classifier on the previous data, we will create a dataset of vectors of each sentence, to do this we will use a comprehension list :

train_x_v = [x.vector for x in docs]

Then we will train the Support Vector Machines classifier :

Build the classifier

from sklearn import svm
clf_svm_wv = svm.SVC(kernel='linear')
clf_svm_wv.fit(train_x_v, train_y)

Now, let's test our classifier on a text like "This is an article", and hopefully the model will classify this text as BOOKS : test = ["This is a paper"] test_docs = [nlp(text) for text in test] test_x_vectors = [x.vector for x in test_docs] clf_svm_wv.predict(test_x_vectors)

[array(['BOOKS'], dtype='<U5')]

Thank you, and see you in the next tutorial !