This article describes some pre-processing steps that are commonly used in Information Retrieval (IR), Natural Language Processing (NLP) and text analytics applications.
In particular, the focus is on the comparison between stemming and lemmatisation, and the need for part-of-speech tagging in this context. The discussion shows some examples in NLTK, also as
Gist on github.
Stemming is the process of reducing a word into its stem, i.e. its root form. The root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix.
For example, the words fish, fishes and fishing all stem into fish, which is a correct word. On the other side, the words study, studies and studying stems into studi, which is not an English word.
Most commonly, stemming algorithms (a.k.a. stemmers) are based on rules for suffix stripping.
The most famous example is the Porter stemmer, introduced in the 1980’s and currently implemented in a variety of programming languages.
Traditionally, search engines and other IR applications have applied stemming to improve the chance of matching different forms of a word, almost treating them like synonyms, as conceptually they “belong” together.
The purpose of Lemmatisation is to group together different inflected forms of a word, called lemma. The process is somehow similar to stemming, as it maps several words into one common root. The output of lemmatisation is a proper word, and basic suffix stripping wouldn’t provide the same outcome. For example, a lemmatiser should map gone, going and went into go. In order to achieve its purpose, lemmatisation requires to know about the context of a word, because the process relies on whether the word is a noun, a verb, etc.
Part-of-speech (POS) tagging is the process of assigning a word to its grammatical category, in order to understand its role within the sentence. Traditional parts of speech are nouns, verbs, adverbs, conjunctions, etc.
Part-of-speech taggers typically take a sequence of words (i.e. a sentence) as input, and provide a list of tuples as output, where each word is associated with the related tag.
Part-of-speech tagging is what provides the contextual information that a lemmatiser needs to choose the appropriate lemma.
Examples in Python and NLTK
One of the most popular packages for NLP in Python is the Natural Language Toolkit (NLTK). It includes several tools for text analytics, as well as training data for some of the tools, and also some well-known data sets.
To install NLTK:
$ sudo pip install nltk
In order to install the additional data, you can use its internal tool. From a Python interactive shell, simply type:
>>> import nltk >>> nltk.download()
This will open a GUI which you can use to choose which data you want to download (if you’re not using a GUI environment, the interface will be textual). In a dev environment, I normally just download all the data for all the packages in the default folder ($HOME/nltk_data) but you can personalise
A full example of stemming, lemmatisation and POS-tagging is available as Gist on github.
Let’s focus on this snippet:
from nltk.stem import PorterStemmer, WordNetLemmatizer stemmer = PorterStemmer() lemmatiser = WordNetLemmatizer() print("Stem %s: %s" % ("studying", stemmer.stem("studying"))) print("Lemmatise %s: %s" % ("studying", lemmatiser.lemmatize("studying"))) print("Lemmatise %s: %s" % ("studying", lemmatiser.lemmatize("studying", pos="v")))
The output will be:
Stem studying: studi Lemmatise studying: studying Lemmatise studying: study
We can observe that the stemming process doesn’t generate a real word, but a root form.
On the other side, the lemmatiser generates real words, but without contextual information it’s not able to distinguish between nouns and verbs, hence the lemmatisation process doesn’t change
the word. The context is provided by the POS tag (“v” for verb in this example).
In order to generate POS tags automatically, nltk comes with a simple function. The snippet for POS tagging:
from nltk import pos_tag from nltk.tokenize import word_tokenize s = "This is a simple sentence" tokens = word_tokenize(s) # Generate list of tokens tokens_pos = pos_tag(tokens) print(tokens_pos)
and the output will be:
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('sentence', 'NN')]
NLTK uses the set of tags from the Penn Treebank project.
Stemming, lemmatisation and POS-tagging are important pre-processing steps in many text analytics applications. You can get up and running very quickly and include these capabilities in your Python applications by using the off-the-shelf solutions in offered by NLTK.