Stemming and lemmatization are algorithms used in natural language processing (NLP) to normalize text and prepare words and documents for further processing in machine learning. They are used, for example, by search engines or chatbots to find out the meaning of words.
In NLP, for example, one wants to recognize the fact that the words “like” and “liked” are the same word in different tenses. The goal is then to reduce both words to a common root, which is done either by stemming or lemmatization. In this way, both words are treated similarly, otherwise “like” and “liked” would be as different for the model as “like” and “car”.
What is Stemming?
Stemming is the process of removing suffixes from words to create a so-called root word. For example, the words “likes”, “likely” and “liked” all result in the common root “like”, which can be used as a synonym for all three words. In this way, an NLP model can learn that all three words are somehow similar and are used in a similar context.
Stemming allows us to standardize words to their base stem regardless of their inflections, which is useful in many applications such as clustering or text classification. Search engines use these techniques extensively to produce better results regardless of word form. Before the introduction of the word stems to Google in 2003, a search for “fish” did not bring up websites about fishes.
Porter’s Stemmer Algorithm is one of the most popular Stemming methods and was proposed in 1980. It is based on the idea that the suffixes in the English language are made up of a combination of smaller and simpler suffixes. It is known for its efficient and simple processes, but it also has several disadvantages.
Since it is based on many, hard-coded rules which result from the English language, it can only be used for English words. Also, there may be cases in which the output of Porter’s Stemmer is not an English word but only an artificial word stem.
from nltk.stem.porter import * porter_stemmer = PorterStemmer() print(porter_stemmer.stem('alumnus')) Out: 'alumnu'
However, the biggest problems are Over- and Understemming which are common shortcomings of most of these algorithms.
What is Over- and Understemming?
Whenever our algorithm stems multiple words to the same root even though they are not related, we call that Over-Stemming. Even though the words “universal”, “university” and “universe” are related and come from the same root word, their meanings are wide-apart from each other. When we would type these words into a good search engine, the search results should be very different and should not be treated as synonyms. We call such an error a false positive.
Under-Stemming is the exact opposite of that behavior and includes cases in which multiple words are not stemmed to a common root even though they should. The word “alumnus” describes a former student of a university and is mostly used for male persons. “Alumnae” is the female version of it and “alumni” are multiple former students of a university.
These words should definitely be treated as synonyms in a basic search engine or other NLP applications. However, most Stemmer algorithms do not cut it to their common root which is a false negative error.
What is Lemmatization?
Lemmatization is a development of Stemmer methods and describes the process of grouping together the different inflected forms of a word so they can be analyzed as a single item. Lemmatizers are similar to Stemmer methods but it brings context to the words. So it links words with similar meanings to one word. Lemmatizer algorithms usually also use positional arguments as inputs, such as whether the word is an adjective, noun, or verb.
Whenever we do text preprocessing for NLP, we need both Stemming as well as Lemmatization. Sometimes you will even find articles or discussions where both words are used as synonyms even though they aren’t. Usually, Lemmatizers are preferred over Stemmer methods because it is a contextual analysis of words instead of using a hard-coded rule to chop off suffixes. However, if the text documents are very long, then a lemmatizer takes considerably more time which is a severe disadvantage.
What is the difference between Lemmatization and Stemming?
In short, the difference between these algorithms is that only a lemmatizer includes the meaning of the word in the evaluation. In stemming, only a certain number of letters are cut off from the end of the word to obtain a word stem. The meaning of the word does not play a role in it.
As seen in the previous example, a lemmatizer recognizes that the English word “better” is derived from the word “good” because both have similar meanings. Stemming, on the other hand, could not make such a distinction and would probably return “bet” or “bett” as the root word.
Is Lemmatization better than Stemming?
In text preprocessing for NLP, we need both stemming and lemmatizers, so both algorithms have their raison d’être. Sometimes you can even find articles or discussions where both words are used as synonyms, although they are not.
Typically, lemmatizers are preferred to stemmer methods because it is a contextual analysis of words rather than using a hard-coded rule to truncate suffixes. This contextuality is especially important when content needs to be specifically understood, as is the case in a chatbot, for example.
For other applications, Stemming’s functionalities may be sufficient. Search engines, for example, use it on a large scale to improve search results. By searching not only the search phrase alone but also the word stems in the index, different word forms can be overcome and the search can also be greatly accelerated.
In which areas are these Algorithms used?
As mentioned earlier, these two methods are particularly interesting in the area of Natural Language Processing. The following applications make use of them:
- Search algorithms: The quality of search results can be significantly improved if, for example, word stems are used and thus misspellings or plural forms are not as significant.
- Knowledge graphs: When building knowledge structures, such as a Knowledge Graph, these algorithms help extract entities, such as people or places, and connect them to other entities. These knowledge graphs can also, in turn, improve search algorithms.
- Sentiment analysis: In many applications, it makes sense to classify texts according to sentiment, for example, positive or negative. This allows product reviews, for example, to be classified very quickly and processed in a more targeted manner. Using the algorithms presented can help the classification model to make better predictions.
This is what you should take with you
- Stemming and Lemmatization are methods that help us in text preprocessing for Natural Language Processing.
- Both of them help to map multiple words to a common root word.
- That way, these words are treated similarly and the model learns that they can be used in similar contexts.
Explanation of the Apache Hadoop Distributed File System with examples and benefits.
Other Articles on the Topic of Stemming vs Lemmatization
- On this website, you can an online tool which lets you test different Stemmer algorithms by processing a word directly online.