Machine LearningNatural Language Processing

Stopwords and Filtering in Natural Language Processing

Stopwords and Filtering

Stopwords and Filtering is the next step in NLP pre-processing after Tokenization. If we consider the same example from the previous blog on Tokenization, we can see that many tokens are rather irrelevant. As a result, we need to filter the required information.

Stopwords

Stop words refers to common words in a language. These are words that do not contain major information but are necessary for making the sentence complete. Some examples of stop words are “in”, “the”, “is”, “an”, etc. We can safely ignore these words without losing the meaning of the content.

Let’s consider the previous example:

["This", "is", "an", "example", "text", "for", "word", "tokenization", ".", "Word", "tokenization", "split", "'s", "texts", "into", "individual", "words", "."]

Here, words like “is”, “an”, “for” add no value to the sentence.

While stop words refer to “common words”, there is no universal set of stop words. Every tool provides a different set of stop words. However, we can see major stop words covered in all the sets.

Since stop words are words that do not contain any information, we can ignore them when training deep learning models for classification.

However, one must note that stop word removal is not recommended in Machine Translation and Summarization tasks.

Stopwords in NLTK

NLTK is one of the tools that provide a downloadable corpus of stop words. Before we begin, we need to download the stopwords. To do so, run the following in Python Shell.

import nltk
nltk.download("stopwords")

Once the download is successful, we can check the stopwords provided by NLTK. As of writing, NLTK has 179 stop words.

To get the list of all the stop words:

from nltk.corpus import stopwords
print(stopwords.words("english"))

Example of some stop words:

["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "you're", "you've", "you'll", "you'd", "your", "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "she's", "her", "hers", "herself"]

Filtering

Now that we know what stop words are, we can use them to filter out in a from a given sentence. Filtering is the process of removing stop words or any unnecessary data from the sentence. We can easily filter stop words using Python. For this purpose, we consider a different but similar example.

import nltk

text = "This is an example text for stopword removal and filtering. This is done using NLTK's stopwords." 
words = nltk.word_tokenize(text)

print("Unfiltered: ", words)
stopwords = nltk.corpus.stopwords.words("english")

cleaned = [word for word in words if word not in stopwords]
print("Filtered: ", cleaned)

Output

Unfiltered:  ["This", "is", "an", "example", "text", "for", "stopword", "removal", "and", "filtering", ".", "This", "is", "done", "using", "NLTK", "'s", "stopwords", "."]
Filtered: ["This", "example", "text", "stopword", "removal", "filtering", ".", "This", "done", "using", "NLTK", "'s", "stopwords", "."]

Although we got rid of stop words, we see that we still have punctuations. For this purpose, we can extend the list of stop words to contain all the punctuations from string.punctuations. Similarly, we can modify the stopwords list as per the application to include or exclude words of our choice.

Moreover, the token “‘s” is not providing any information. A crude way to tackle this issue is to remove all words with less than 2 characters. Similarly, other filters can be applied to get the most information from the text.

Additionally, we see that some words are capitalized. Hence, to avoid treating them as 2 different strings, we convert all tokens into lowercase.

import nltk
import string

text = "This is an example text for stopword removal and filtering. This is done using NLTK's stopwords."  
words = nltk.word_tokenize(text)
stopwords = nltk.corpus.stopwords.words("english")

# Extending the stopwords list
stopwords.extend(string.punctuation)

# Remove stop words and tokens with length < 2
cleaned = [word.lower() for word in words if (word not in stopwords) and len(word) > 2]
print(cleaned)

Final Output

["this", "example", "text", "stopword", "removal", "filtering", "this", "done", "using", "nltk", "stopwords"]

Thus, we used stopwords and filtering to pre-process our text to keep only the essential tokens and ignore the irrelevant information.

Further in the series: Stemming and Lemmatization in Natural Language Processing

References

Deep Mehta
Deep Mehta is a Machine Learning Engineer, Web Developer and Technical Blogger, currently pursuing Masters in Computer Science from New York University. In addition to being one of the founders of byteiota.com, he is an enthusiast in the domain of Artificial Intelligence. When he isn't working, he is either reading or writing a blog.

You may also like

2 Comments

  1. correction:
    at the second code bracket you import the stopwords
    you use:
    from nltk import stopwords

    instead:
    from nltk.corpus import stopwords

    later you use correctly

    1. Fixed, Thanks for pointing out!

Leave a reply

Your email address will not be published.