Tutorial 2 : Natural Language Processing (Regular expressions and stemming)

PermalinkIntroduction

In this tutorial, we will learn how regular expressions work and what stemming is. Then we will implement these concepts in python using the re and nltk libraries.

PermalinkInstallation

Before we start, let's check that the libraries are installed, if not let's install them with the following commands: pip install regex pip install nltk

PermalinkRegular expressions

A regular expression is defined as a string that describes, according to a precise syntax, a set of possible strings.
Consider the following text : ["hello world", "what", "are", "you", "doing"] If we want to get the words starting with w, we will define our regular expression as follows:

import re
regex = re.compile(r"^w")

here, ^ indicates the start of string, or start of line in multi-line pattern. then we will define our string :

test_phrase = ["hello world", "what", "are", "you", "doing"]

then we will define an empty list called matches in which we will store the words if they match our regular expression : matches = [] for phrase in test_phrase: if re.match(regex, phrase): matches.append(phrase) Finally, if we pring matches :

print(matches)

For more details on regular expressions, have a look at this handy sheet : CheatSheet or visit this webiste regex

PermalinkStemming

We will now learn how stemming works. In the example in Tutorial 0 [Tutorial 0] (hamzaelyousfi.hashnode.dev/tutorial-0-intro..), we used the words in their line format, which means that for the model, the words BOOKS and BOOK have nothing to do with each other. Stemming takes the radical of each word, which means that both BOOKS and BOOK will be represented by their radical BOOK.

In python, let's import the nltk library: import nltk from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer then let's define our stemmer :

stemmer = PorterStemmer()
phrase = "reading the books."

Then let's tokenize the words : words = word_tokenize(phrase) print(words) Then, let's extract the stemmed words : ``` stemmed_words = []

for word in words: stemmed_words.append(stemmer.stem(word))

" ".join(stemmed_words) ```

PermalinkFinally ...

In the next tutorials, we will look at lemmatization, stop words and other convepts, stay tuned! Thanks to you!