Tutorial 2 : Natural Language Processing (Regular expressions and stemming)

Tutorial 2 : Natural Language Processing (Regular expressions and stemming)

Let's Learn the basics of NLP

ยท

2 min read

Introduction

In this tutorial, we will learn how regular expressions work and what stemming is. Then we will implement these concepts in python using the re and nltk libraries.

Installation

Before we start, let's check that the libraries are installed, if not let's install them with the following commands: pip install regex pip install nltk

Regular expressions

A regular expression is defined as a string that describes, according to a precise syntax, a set of possible strings.
Consider the following text : ["hello world", "what", "are", "you", "doing"] If we want to get the words starting with w, we will define our regular expression as follows:

import re
regex = re.compile(r"^w")

here, ^ indicates the start of string, or start of line in multi-line pattern. then we will define our string :

test_phrase = ["hello world", "what", "are", "you", "doing"]

then we will define an empty list called matches in which we will store the words if they match our regular expression : matches = [] for phrase in test_phrase: if re.match(regex, phrase): matches.append(phrase) Finally, if we pring matches :

print(matches)

For more details on regular expressions, have a look at this handy sheet : CheatSheet or visit this webiste regex

Stemming

We will now learn how stemming works. In the example in Tutorial 0 [Tutorial 0] (hamzaelyousfi.hashnode.dev/tutorial-0-intro..), we used the words in their line format, which means that for the model, the words BOOKS and BOOK have nothing to do with each other. Stemming takes the radical of each word, which means that both BOOKS and BOOK will be represented by their radical BOOK.

In python, let's import the nltk library: import nltk from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer then let's define our stemmer :

stemmer = PorterStemmer()
phrase = "reading the books."

Then let's tokenize the words : words = word_tokenize(phrase) print(words) Then, let's extract the stemmed words : ``` stemmed_words = []

for word in words: stemmed_words.append(stemmer.stem(word))

" ".join(stemmed_words) ```

Finally ...

In the next tutorials, we will look at lemmatization, stop words and other convepts, stay tuned! Thanks to you!

Did you find this article valuable?

Support Interrupt101 by becoming a sponsor. Any amount is appreciated!

ย