Stemming and lemmatization are two of the most important preprocessing tasks in natural language processing. Both stemming and lemmatization attempt to reduce a word to its root form. The concept of stemming and lemmatization are often confused together. This article explains the differences between stemming and lemmatization with the help of examples. The article explains the following concepts:
Installing NLTK
NLTK (natural language toolkit) is one of the most commonly used Python modules for natural language processing tasks. You will need to download NLTK in order to execute the examples provided in this article. To download NLTK, execute the following pip command on your command terminal:
pip install nltk
We will first understand stemming with the help of examples and will later see how lemmatization differs from stemming.
What is Stemming?
In natural language, different words may contain the same stem. For instance, the words delete, deleted, deleting, and deletes have the same stem i.e. delete. Stemming refers to the process of reducing a word to its stem form. Let’s see a simple example of stemming using Python’s NLTK library.
To perform stemming, you can use PorterStemmer from NLTK library. To reduce a word to its stem form, you have to create an object of the PorterStemmer class and then pass the word to the stem() method of the PorterStemmer as shown below:
from nltk.stem import PorterStemmer word_list = ["delete", "deleted", "deleting", "deletes"] ps =PorterStemmer() for word in word_list : stem = ps.stem(word) print(word, "===", stem)
Output:
delete === delet deleted === delet deleting === delet deletes === delet
You can see from the output that the words delete, deleted, deleting, and deletes are reduced to its stem delete.
Similarly, you can use the PorterStemmer to reduce words in a sentence to their stem form as shown in the following example:
import nltk doc = "John was busy hence he deleted and copied the old text" word_list = nltk.word_tokenize(doc) ps =PorterStemmer() for word in word_list : stem = ps.stem(word) print(word, "===", stem)
Output:
John === john was === wa busy === busi hence === henc he === he deleted === delet and === and copied === copi the === the old === old text === text
You can see that all the words in the input sentence have been converted to their root form.
What is Lemmatization?
Lemmatization also reduces a word but instead of reducing a word to its stem, lemmatization reduces a word to its dictionary root form. Unlike stemming, where one word can have only one stem, for lemmatization, one word can have different dictionary root forms depending upon the part of speech of the word. For instance, with lemmatization, if you consider the word deleted as a noun, the reduced dictionary root form for the word deleted will also be deleted. You can test this with the help of an example.
To implement lemmatization with NLTK, you need to create an object of the WordNetLemmatizer class and pass the word to the lemmatize() method of the object. Here is an example.
import nltk from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() word = "deleted" lemma = lemmatizer.lemmatize(word) print(word, "===", lemma)
Output:
deleted === deleted
Since by default the lemmatize() method treats words as a nouns, the word deleted has not been changed since it can also be used as a noun.
Let’s now reduce the word deleted to its dictionary root form by treating it as a verb. To do so, you have to pass a POS tag value as the second parameter of the lemmatize() method. To lemmatize a the word deleted as a verb, you need to pass “v” as the second parameter to the lemmatize() method as shown below:
import nltk from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() word = "deleted" lemma = lemmatizer.lemmatize(word,"v") print(word, "===", lemma)
Output:
deleted === delete
From the output, you can see that the word deleted when treated as a verb is lemmatized as delete.
In the same way, you can lemmatize adjectives as shown below:
import nltk from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() word = "strongest" lemma = lemmatizer.lemmatize(word,"a") print(word, "===", lemma) Output:
strongest === strong
In the above script, the adjective strongest is reduced to its dictionary root form i.e. strong.
Difference between Stemming and Lemmatization
From the above discussion, it can be concluded that there are two main differences between stemming and lemmatization:
- Stemming reduces a word into its stem even if the stem is meaningless. For instance, the word deleted is reduced to delete while stemming. The word delete doesn’t exist in a dictionary. Lemmatization on the other hand reduces a word into its dictionary root form. For instance, the word deleted when treated as a verb is reduced to delete while stemming. The reduced word delete also exists in a dictionary.
- The second difference is that stemming doesn’t take part of speech of a word into account while reducing a word into its stem. On the other hand, lemmatization is performed based on the part of speech of a word. For example, a word when treated as a noun is lemmatized differently compared to when it is treated as a verb.