This article explains how you can remove stop words from natural languages such as English, French, Spanish, etc. Stop words are words in natural languages with very little meaning. For instance the English language stop words include words such as “am”, “are”, “is”, “not”, “has”. Since stop words occur in abundance in text, they have a very lower classification power and hence can be removed from the text during the data preprocessing step.
You can manually remove stop words from a Python list if you know the list of all the stopwords. However, several Python libraries exist that you can use to remove stop words from natural languages. In this article, you will see how to remove stop words using NLTK, Gensim, and SpaCy libraries. So let’s being without ado.
Table of Contents
- Removing Stopwords using NLTK Library
- Removing Stop Words from the English Language using NLTK
- Removing Stop Words from the Spanish Language using NLTK
- Removing Stopwords Using SpaCy Library
- Removing Stopwords Using Gensim Library
Removing Stopwords using NLTK Library
NLTK stands for Natural Language Toolkit. The NLTK library is the oldest and one of the most commonly used libraries for natural language processing (NLP). NLTK supports a variety of NLP functions such as tokenizing sentences into words, finding sentiments from words, finding named entities, parts of speech tagging, etc. In this section, you will see the usage of the NLTK library for removing stop words from the text.
Before, you can remove stop words using the NLTK library. You need to install the NLTK library. To do so, go to your system commands terminal and run the following command:
Next, you need to import the “stopwords” sub-module from the nltk.corpus module. Run the following script:
from nltk.corpus import stopwords
Execute the following script to see the languages supported by NLTK for the removal of stop words:
print(stopwords.fileids())
In the output, you should see a list of languages shown below:
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'slovene', 'spanish', 'swedish', 'tajik', 'turkish']
Let’s first see all the stop words in the English language. To do so, you can use the words attribute from the stopwords module and pass it the string literal “English” as a parameter.
print(stopwords.words(‘english’))
The output below shows a list of all English stop words supported by the NLTK library.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Removing Stop Words from the English Language using NLTK
Let’s now see a complete example of how you can remove stop words from the English language using the NLTK library.
import nltk from nltk.corpus import stopwords nltk.download('stopwords') from nltk.tokenize import word_tokenize english_sw = stopwords.words("english") sentence = "PDF.co is a website that contains different tools to read, write and process PDF documents" words = word_tokenize(sentence) sentence_wo_stopwords = [word for word in words if not word in english_sw] print(" ".join(sentence_wo_stopwords))
In the script above, a list of English stop words is stored in “english_sw” variable. Next, the sentence from which you want to remove stop words is tokenized (broken down into a list of individual words.) Finally, list comprehension is used to iterate through the list of words in the input sentence. If the word doesn’t exist in the “english_sw” list (list of stop words), the word is added to the final resultant list of words, else the word is ignored. Finally, the list of words that do not stop words is joined together using the join() method and the resulting sentence is displayed on the console. Here is the output when you remove stop words from the sentence “PDF.co is a website that contains different tools to read, write and process PDF documents”:
PDF.co website contains different tools read , write process PDF documents
You can see that the stop words are removed from the input sentence.
Removing Stop Words from the Spanish Language using NLTK
With NLTK, you can remove stopwords from the other supported languages as well. For instance, the following script removes stop words from a sentence in the Spanish language:
spanish_sw = stopwords.words("spanish") sentence = "PDF.co es un sitio web que contiene diferentes herramientas para leer, escribir y procesar documentos PDF" words = word_tokenize(sentence) stence_wo_stopwords = [word for word in words if not word in spanish_sw] print(" ".join(stence_wo_stopwords))
After removing stop words from the Spanish sentence, “PDF.co es un sitio web que contiene diferentes herramientas para leer, escribir y procesar documentos PDF” the output sentence is as follows:
PDF.co sitio web contiene diferentes herramientas leer , escribir procesar documentos PDF
Removing Stopwords Using SpaCy Library
In addition to NLTK, you can also use Python’s SpaCy library for removing stop words. To do so, you first need to install the SpaCy library. Execute the following script on your command terminal:
pip install spacy
The SpaCy library supports various language models. You will be using the English language model for stop words removal in this section. To download and install the SpaCy English language model, execute the following command on your command terminal.
python -m spacy download en
To import the SpaCy’s list of stop words into your Python application, you first have to load the SpaCy English model that you just downloaded using the load() method. Next, you can use the “Defaults.stop_words” attribute of the SpaCy English model to import the list of stop words. the following script shows how you can remove stop words from a sentence using the SpaCy library.
import nltk from nltk.corpus import stopwords nltk.download('punkt') nltk.download('stopwords') from nltk.tokenize import word_tokenize import spacy spacy_model = spacy.load('en_core_web_sm') english_sw = spacy_model .Defaults.stop_words sentence = "PDF.co is a website that contains different tools to read, write and process PDF documents" words = word_tokenize(sentence) sentence_wo_stopwords = [word for word in words if not word in english_sw] print(" ".join(sentence_wo_stopwords))
Here is the output of the above script:
PDF.co website contains different tools read , write process PDF documents
Removing Stopwords Using Gensim Library
Finally, you can also remove stop words using Python’s Gensim library. To install the Gensim library in Python, execute the following command on your command terminal:
pip install gensim
To remove stop words via the Gensim library, you simply have to pass the text sentence to the remove_stopwords() method of the “gensim.parsing.preprocessing” module as shown in the following script:
from gensim.parsing.preprocessing import remove_stopwords sentence = "PDF.co is a website that contains different tools to read, write and process PDF documents" sentence_wo_stopwords = remove_stopwords(sentence) print(sentence_wo_stopwords)
Here is the output of the above script:
PDF.co website contains different tools read, write process PDF documents