Stemming, Lemmatisation and POS-tagging with Python and NLTK

This article describes some pre-processing steps that are commonly used in Information Retrieval (IR), Natural Language Processing (NLP) and text analytics applications.

In particular, the focus is on the comparison between stemming and lemmatisation, and the need for part-of-speech tagging in this context. The discussion shows some examples in NLTK, also as
Gist on github.

Stemming

Stemming is the process of reducing a word into its stem, i.e. its root form. The root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix.

For example, the words fish, fishes and fishing all stem into fish, which is a correct word. On the other side, the words study, studies and studying stems into studi, which is not an English word.

Most commonly, stemming algorithms (a.k.a. stemmers) are based on rules for suffix stripping.
The most famous example is the Porter stemmer, introduced in the 1980’s and currently implemented in a variety of programming languages.

Traditionally, search engines and other IR applications have applied stemming to improve the chance of matching different forms of a word, almost treating them like synonyms, as conceptually they “belong” together.

Lemmatisation

The purpose of Lemmatisation is to group together different inflected forms of a word, called lemma. The process is somehow similar to stemming, as it maps several words into one common root. The output of lemmatisation is a proper word, and basic suffix stripping wouldn’t provide the same outcome. For example, a lemmatiser should map gone, going and went into go. In order to achieve its purpose, lemmatisation requires to know about the context of a word, because the process relies on whether the word is a noun, a verb, etc.

Part-of-speech Tagging

Part-of-speech (POS) tagging is the process of assigning a word to its grammatical category, in order to understand its role within the sentence. Traditional parts of speech are nouns, verbs, adverbs, conjunctions, etc.

Part-of-speech taggers typically take a sequence of words (i.e. a sentence) as input, and provide a list of tuples as output, where each word is associated with the related tag.

Part-of-speech tagging is what provides the contextual information that a lemmatiser needs to choose the appropriate lemma.

Examples in Python and NLTK

One of the most popular packages for NLP in Python is the Natural Language Toolkit (NLTK). It includes several tools for text analytics, as well as training data for some of the tools, and also some well-known data sets.

To install NLTK:

pip install nltk

In order to install the additional data, you can use its internal tool. From a Python interactive shell, simply type:

import nltk
nltk.download()

This will open a GUI which you can use to choose which data you want to download (if you’re not using a GUI environment, the interface will be textual). In a dev environment, I normally just download all the data for all the packages in the default folder ($HOME/nltk_data) but you can personalise
your installation.

A full example of stemming, lemmatisation and POS-tagging is available as Gist on github.

Let’s focus on this snippet:

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatiser = WordNetLemmatizer()

print("Stem %s: %s" % ("studying", stemmer.stem("studying")))
print("Lemmatise %s: %s" % ("studying", lemmatiser.lemmatize("studying")))
print("Lemmatise %s: %s" % ("studying", lemmatiser.lemmatize("studying", pos="v")))

The output will be:

Stem studying: studi
Lemmatise studying: studying
Lemmatise studying: study

We can observe that the stemming process doesn’t generate a real word, but a root form.
On the other side, the lemmatiser generates real words, but without contextual information it’s not able to distinguish between nouns and verbs, hence the lemmatisation process doesn’t change
the word. The context is provided by the POS tag (“v” for verb in this example).

In order to generate POS tags automatically, nltk comes with a simple function. The snippet for POS tagging:

from nltk import pos_tag
from nltk.tokenize import word_tokenize

s = "This is a simple sentence"
tokens = word_tokenize(s) # Generate list of tokens
tokens_pos = pos_tag(tokens) 

print(tokens_pos)

and the output will be:

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('sentence', 'NN')]

NLTK uses the set of tags from the Penn Treebank project.

Summary

Stemming, lemmatisation and POS-tagging are important pre-processing steps in many text analytics applications. You can get up and running very quickly and include these capabilities in your Python applications by using the off-the-shelf solutions in offered by NLTK.

Published by

Marco

Data Scientist

2 thoughts on “Stemming, Lemmatisation and POS-tagging with Python and NLTK”

  1. passing the tokens after POS-tagging them does not allow them to be run in the WordNetLemmatizer. How does one go about using the correct POS tagger for this? I have tried the Stanford tagger with no success either…

    Like

  2. The POS tagger in NLTK is trained on the Treebank corpus, meaning it also uses the Treebank POS tags, i.e. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

    On the other side, WordNet uses less classes, as you can see from here (see the declaration of “part-of-speech constants”): http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html

    One solution is to write a small helper that just matches the starting letter of the Treebank POS, returning the relevant WordNet POS, e.g. http://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python

    Like

Leave a comment