Marco Bonzanini

Mining Twitter Data with Python: Part 5 – Data Visualisation Basics

A picture is worth a thousand tweets: more often than not, designing a good visual representation of our data, can help us make sense of them and highlight interesting insights. After collecting and analysing Twitter data, the tutorial continues with some notions on data visualisation with Python.

Tutorial Table of Contents:

Part 1: Collecting data
Part 2: Text Pre-processing
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics (this article)
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

From Python to Javascript with Vincent

While there are some options to create plots in Python using libraries like matplotlib or ggplot, one of the coolest libraries for data visualisation is probably D3.js which is, as the name suggests, based on Javascript. D3 plays well with web standards like CSS and SVG, and allows to create some wonderful interactive visualisations.

Vincent bridges the gap between a Python back-end and a front-end that supports D3.js visualisation, allowing us to benefit from both sides. The tagline of Vincent is in fact “The data capabilities of Python. The visualization capabilities of JavaScript”. Vincent, a Python library, takes our data in Python format and translates them into Vega, a JSON-based visualisation grammar that will be used on top of D3. It sounds quite complicated, but it’s fairly simple and pythonic. You don’t have to write a line in Javascript/D3 if you don’t want to.

Firstly, let’s install Vincent:

pip install vincent

Secondly, let’s create our first plot. Using the list of most frequent terms (without hashtags) from our rugby data set, we want to plot their frequencies:

import vincent

word_freq = count_terms_only.most_common(20)
labels, freq = zip(*word_freq)
data = {'data': freq, 'x': labels}
bar = vincent.Bar(data, iter_idx='x')
bar.to_json('term_freq.json')

At this point, the file term_freq.json will contain a description of the plot that can be handed over to D3.js and Vega. A simple template (taken from Vincent resources) to visualise the plot:

&lt;html&gt;
&lt;head&gt;
    &lt;title&gt;Vega Scaffold&lt;/title&gt;
    &lt;script src="http://d3js.org/d3.v3.min.js" charset="utf-8"&gt;&lt;/script&gt;
    &lt;script src="http://d3js.org/topojson.v1.min.js"&gt;&lt;/script&gt;
    &lt;script src="http://d3js.org/d3.geo.projection.v0.min.js" charset="utf-8"&gt;&lt;/script&gt;
    &lt;script src="http://trifacta.github.com/vega/vega.js"&gt;&lt;/script&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;div id="vis"&gt;&lt;/div&gt;
&lt;/body&gt;
&lt;script type="text/javascript"&gt;
// parse a spec and create a visualization view
function parse(spec) {
  vg.parse.spec(spec, function(chart) { chart({el:"#vis"}).update(); });
}
parse("term_freq.json");
&lt;/script&gt;
&lt;/html&gt;

Save the above HTML page as chart.html and run the simple Python web server:

python -m http.server 8888 # Python 3
python -m SimpleHTTPServer 8888 # Python 2

Now you can open your browser at http://localhost:8888/chart.html and observe the result:

Notice: you could save the HTML template directly from Python with:

bar.to_json('term_freq.json', html_out=True, html_path='chart.html')

but, at least in Python 3, the output is not a well formed HTML and you’d need to manually strip some characters.

With this procedure, we can plot many different types of charts with Vincent. Let’s take a moment to browse the docs and see its capabilities.

Time Series Visualisation

Another interesting aspect of analysing data from Twitter is the possibility to observe the distribution of tweets over time. In other words, if we organise the frequencies into temporal buckets, we could observe how Twitter users react to real-time events.

One of my favourite tools for data analysis with Python is Pandas, which also has a fairly decent support for time series. As an example, let’s track the hashtag #ITAvWAL to observe what happened during the first match.

Firstly, if we haven’t done it yet, we need to install Pandas:

pip install pandas

In the main loop which reads all the tweets, we simply track the occurrences of the hashtag, i.e. we can refactor the code from the previous episodes into something similar to:

import pandas
import json

dates_ITAvWAL = []
# f is the file pointer to the JSON data set
for line in f:
    tweet = json.loads(line)
    # let's focus on hashtags only at the moment
    terms_hash = [term for term in preprocess(tweet['text']) if term.startswith('#')]
    # track when the hashtag is mentioned
    if '#itavwal' in terms_hash:
        dates_ITAvWAL.append(tweet['created_at'])

# a list of "1" to count the hashtags
ones = [1]*len(dates_ITAvWAL)
# the index of the series
idx = pandas.DatetimeIndex(dates_ITAvWAL)
# the actual series (at series of 1s for the moment)
ITAvWAL = pandas.Series(ones, index=idx)

# Resampling / bucketing
per_minute = ITAvWAL.resample('1Min', how='sum').fillna(0)

The last line is what allows us to track the frequencies over time. The series is re-sampled with intervals of 1 minute. This means all the tweets falling within a particular minute will be aggregated, more precisely they will be summed up, given how='sum'. The time index will not keep track of the seconds anymore. If there is no tweet in a particular minute, the fillna() function will fill the blanks with zeros.

To put the time series in a plot with Vincent:

time_chart = vincent.Line(ITAvWAL)
time_chart.axis_titles(x='Time', y='Freq')
time_chart.to_json('time_chart.json')

Once you embed the time_chart.json file into the HTML template discussed above, you’ll see this output:

The interesting moments of the match are observable from the spikes in the series. The first spike just before 1pm corresponds to the first Italian try. All the other spikes between 1:30 and 2:30pm correspond to Welsh tries and show the Welsh dominance during the second half. The match was over by 2:30, so after that Twitter went quiet.

Rather than just observing one sequence at a time, we could compare different series to observe how the matches has evolved. So let’s refactor the code for the time series, keeping track of the three different hashtags #ITAvWAL, #SCOvIRE and #ENGvFRA into the corresponding pandas.Series.

# all the data together
match_data = dict(ITAvWAL=per_minute_i, SCOvIRE=per_minute_s, ENGvFRA=per_minute_e)
# we need a DataFrame, to accommodate multiple series
all_matches = pandas.DataFrame(data=match_data,
                               index=per_minute_i.index)
# Resampling as above
all_matches = all_matches.resample('1Min', how='sum').fillna(0)

# and now the plotting
time_chart = vincent.Line(all_matches[['ITAvWAL', 'SCOvIRE', 'ENGvFRA']])
time_chart.axis_titles(x='Time', y='Freq')
time_chart.legend(title='Matches')
time_chart.to_json('time_chart.json')

And the output:

We can immediately observe when the different matches took place (approx 12:30-2:30, 2:30-4:30 and 5-7) and we can see how the last match had the all the attentions, especially in the end when the winner was revealed.

Summary

Data visualisation is an important discipline in the bigger context of data analysis. By supporting visual representations of our data, we can provide interesting insights. We have discussed a relatively simple option to support data visualisation with Python using Vincent. In particular, we have seen how we can easily bridge the gap between Python and a language like Javascript that offers a great tool like D3.js, one of the most important libraries for interactive visualisation. Overall, we have just scratched the surface of data visualisation, but as a starting point this should be enough to get some nice ideas going. The nature of Twitter as a medium has also encouraged a quick look into the topic of time series analysis, allowing us to mention pandas as a great Python tool.

If this article has given you some ideas for data visualisation, please leave a comment below or get in touch.

@MarcoBonzanini

Tutorial Table of Contents:

Part 1: Collecting data
Part 2: Text Pre-processing
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics (this article)
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

Mining Twitter Data with Python (Part 4: Rugby and Term Co-occurrences)

Last Saturday was the closing day of the Six Nations Championship, an annual international rugby competition. Before turning on the TV to watch Italy being trashed by Wales, I decided to use this event to collect some data from Twitter and perform some exploratory text analysis on something more interesting than the small list of my tweets.

This article continues the tutorial on Twitter Data Mining, re-using what we discussed in the previous articles with some more realistic data. It also expands the analysis by introducing the concept of term co-occurrence.

Tutorial Table of Contents:

Part 1: Collecting data
Part 2: Text Pre-processing
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences (this article)
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

The Application Domain

As the name suggests, six teams are involved in the competition: England, Ireland, Wales, Scotland, France and Italy. This means that we can expect the event to be tweeted in multiple languages (English, French, Italian, Welsh, Gaelic, possibly other languages as well), with English being the major language. Assuming the team names will be mentioned frequently, we could decide to look also for their nicknames, e.g. Les Bleus for France or Azzurri for Italy. During the last day of the competition, three matches are played sequentially. Three teams in particular had a shot for the title: England, Ireland and Wales. At the end, Ireland won the competition but everything was open until the very last minute.

Setting Up

I used the streaming API to download all the tweets containing the string #rbs6nations during the day. Obviously not all the tweets about the event contained the hashtag, but this is a good baseline. The time frame for the download was from around 12:15PM to 7:15PM GMT, that is from about 15 minutes before the first match, to about 15 minutes after the last match was over. At the end, more than 18,000 tweets have been downloaded in JSON format, making for about 75Mb of data. This should be small enough to quickly do some processing in memory, and at the same time big enough to observe something possibly interesting.

The textual content of the tweets has been pre-processed with tokenisation and lowercasing using the preprocess() function introduced in Part 2 of the tutorial.

Interesting terms and hashtags

Following what we discussed in Part 3 (Term Frequencies), we want to observe the most common terms and hashtags used during day. If you have followed the discussion about creating different lists of tokens in order to capture terms without hashtags, hashtags only, removing stop-words, etc. you can play around with the different lists.

This is the unsurprising list of top 10 most frequent terms (terms_only in Part 3) in the data set.

[('ireland', 3163), ('england', 2584), ('wales', 2271), ('…', 2068), ('day', 1479), ('france', 1380), ('win', 1338), ('rugby', 1253), ('points', 1221), ('title', 1180)]

The first three terms correspond to the teams who had a go for the title. The frequencies also respect the order in the final table. The fourth term is instead a punctuation mark that we missed and didn’t include in the list of stop-words. This is because string.punctuation only contains ASCII symbols, while here we’re dealing with a unicode character. If we dig into the data, there will be more examples like this, but for the moment we don’t worry about it.

After adding the suspension-points symbol to the list of stop-words, we have a new entry at the end of the list:

[('ireland', 3163), ('england', 2584), ('wales', 2271), ('day', 1479), ('france', 1380), ('win', 1338), ('rugby', 1253), ('points', 1221), ('title', 1180), ('🍀', 1154)]

Interestingly, a new token we didn’t account for, an Emoji symbol (in this case, the Irish Shamrock).

If we have a look at the most common hashtags, we need to consider that #rbs6nations will be by far the most common token (that’s our search term for downloading the tweets), so we can exclude it from the list. This leave us with:

[('#engvfra', 1701), ('#itavwal', 927), ('#rugby', 880), ('#scovire', 692), ('#ireland', 686), ('#angfra', 554), ('#xvdefrance', 508), ('#crunch', 500), ('#wales', 446), ('#england', 406)]

We can observe that the most common hashtags, a part from #rugby, are related to the individual matches. In particular England v France has received the highest number of mentions, probably being the last match of the day with a dramatic finale. Something interesting to notice is that a fair amount of tweets also contained terms in French: the count for #angfra should in fact be added to #engvfra. Those unfamiliar with rugby probably wouldn’t recognise that also #crunch should be included with #EngvFra match, as Le Crunch is the traditional name for this event. So by far, the last match has received a lot of attention.

Term co-occurrences

Sometimes we are interested in the terms that occur together. This is mainly because the context gives us a better insight about the meaning of a term, supporting applications such as word disambiguation or semantic similarity. We discussed the option of using bigrams in the previous article, but we want to extend the context of a term to the whole tweet.

We can refactor the code from the previous article in order to capture the co-occurrences. We build a co-occurrence matrix com such that com[x][y] contains the number of times the term x has been seen in the same tweet as the term y:

from collections import defaultdict
# remember to include the other import from the previous post

com = defaultdict(lambda : defaultdict(int))

# f is the file pointer to the JSON data set
for line in f: 
    tweet = json.loads(line)
    terms_only = [term for term in preprocess(tweet['text']) 
                  if term not in stop 
                  and not term.startswith(('#', '@'))]

    # Build co-occurrence matrix
    for i in range(len(terms_only)-1):            
        for j in range(i+1, len(terms_only)):
            w1, w2 = sorted([terms_only[i], terms_only[j]])                
            if w1 != w2:
                com[w1][w2] += 1

While building the co-occurrence matrix, we don’t want to count the same term pair twice, e.g. com[A][B] == com[B][A], so the inner for loop starts from i+1 in order to build a triangular matrix, while sorted will preserve the alphabetical order of the terms.

For each term, we then extract the 5 most frequent co-occurrent terms, creating a list of tuples in the form ((term1, term2), count):

com_max = []
# For each term, look for the most common co-occurrent terms
for t1 in com:
    t1_max_terms = sorted(com[t1].items(), key=operator.itemgetter(1), reverse=True)[:5]
    for t2, t2_count in t1_max_terms:
        com_max.append(((t1, t2), t2_count))
# Get the most frequent co-occurrences
terms_max = sorted(com_max, key=operator.itemgetter(1), reverse=True)
print(terms_max[:5])

The results:

[(('6', 'nations'), 845), (('champions', 'ireland'), 760), (('nations', 'rbs'), 742), (('day', 'ireland'), 731), (('ireland', 'wales'), 674)]

This implementation is pretty straightforward, but depending on the data set and on the use of the matrix, one might want to look into tools like scipy.sparse for building a sparse matrix.

We could also look for a specific term and extract its most frequent co-occurrences. We simply need to modify the main loop including an extra counter, for example:

search_word = sys.argv[1] # pass a term as a command-line argument
count_search = Counter()
for line in f:
    tweet = json.loads(line)
    terms_only = [term for term in preprocess(tweet['text']) 
                  if term not in stop 
                  and not term.startswith(('#', '@'))]
    if search_word in terms_only:
        count_search.update(terms_only)
print("Co-occurrence for %s:" % search_word)
print(count_search.most_common(20))

The outcome for “ireland”:

[('champions', 756), ('day', 727), ('nations', 659), ('wales', 654), ('2015', 638), ('6', 613), ('rbs', 585), ('http://t.co/y0nvsvayln', 559), ('🍀', 526), ('10', 522), ('win', 377), ('england', 377), ('twickenham', 361), ('40', 360), ('points', 356), ('sco', 355), ('ire', 355), ('title', 346), ('scotland', 301), ('turn', 295)]

The outcome for “rugby”:

[('day', 476), ('game', 160), ('ireland', 143), ('england', 132), ('great', 105), ('today', 104), ('best', 97), ('well', 90), ('ever', 89), ('incredible', 87), ('amazing', 84), ('done', 82), ('amp', 71), ('games', 66), ('points', 64), ('monumental', 58), ('strap', 56), ('world', 55), ('team', 55), ('http://t.co/bhmeorr19i', 53)]

Overall, quite interesting.

Summary

This article has discussed a toy example of Text Mining on Twitter, using some realistic data taken during a sport event. Using what we have learnt in the previous episodes, we have downloaded some data using the streaming API, pre-processed the data in JSON format and extracted some interesting terms and hashtags from the tweets. The article has also introduced the concept of term co-occurrence, shown how to build a co-occurrence matrix and discussed how to use it to find some interesting insight.

@MarcoBonzanini

Part 1: Collecting data
Part 2: Text Pre-processing
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences (this article)
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

Mining Twitter Data with Python (Part 3: Term Frequencies)

This is the third part in a series of articles about data mining on Twitter. After collecting data and pre-processing some text, we are ready for some basic analysis. In this article, we’ll discuss the analysis of term frequencies to extract meaningful terms from our tweets.

Tutorial Table of Contents:

Part 1: Collecting data
Part 2: Text Pre-processing
Part 3: Term Frequencies (this article)
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

Counting Terms

Assuming we have collected a list of tweets (see Part 1 of the tutorial), the first exploratory analysis that we can perform is a simple word count. In this way, we can observe what are the terms most commonly used in the data set. In this example, I’ll use the set of my tweets, so the most frequent words should correspond to the topics I discuss (not necessarily, but bear with be for a couple of paragraphs).

We can use a custom tokeniser to split the tweets into a list of terms. The following code uses the preprocess() function described in Part 2 of the tutorial, in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions, emoticons and URLs. In order to keep track of the frequencies while we are processing the tweets, we can use collections.Counter() which internally is a dictionary (term: count) with some useful methods like most_common():

import operator 
import json
from collections import Counter

fname = 'mytweets.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        terms_all = [term for term in preprocess(tweet['text'])]
        # Update the counter
        count_all.update(terms_all)
    # Print the first 5 most frequent words
    print(count_all.most_common(5))

The above code will produce some unimpressive results:

[(':', 44), ('rt', 26), ('to', 26), ('and', 25), ('on', 22)]

As you can see, the most frequent words (or should I say, tokens), are not exactly meaningful.

Removing stop-words

In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called stop-words. In the example above, we can see three common stop-words – to, and and on. Stop-word removal is one important step that should be considered during the pre-processing stages. One can build a custom list of stop-words, or use available lists (e.g. NLTK provides a simple list for English stop-words).

Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.

from nltk.corpus import stopwords
import string

punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']

We can now substitute the variable terms_all in the first example with something like:

terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]

After counting, sorting the terms and printing the top 5, this is the result:

[('python', 11), ('@miguelmalvarez', 9), ('#python', 9), ('data', 8), ('@danielasfregola', 7)]

So apparently I mostly tweet about Python and data, and the users I re-tweet more often are @miguelmalvarez and @danielasfregola, it sounds about right.

More term filters

Besides stop-word removal, we can further customise the list of terms/tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:

# Count terms only once, equivalent to Document Frequency
terms_single = set(terms_all)
# Count hashtags only
terms_hash = [term for term in preprocess(tweet['text']) 
              if term.startswith('#')]
# Count terms only (no hashtags, no mentions)
terms_only = [term for term in preprocess(tweet['text']) 
              if term not in stop and 
              not term.startswith(('#', '@'))] 
              # mind the ((double brackets))
              # startswith() takes a tuple (not a list) if 
              # we pass a list of inputs

After counting and sorting, these are my most commonly used hashtags:

[('#python', 9), ('#scala', 6), ('#nosql', 4), ('#bigdata', 3), ('#nlp', 3)]

and these are my most commonly used terms:

[('python', 11), ('data', 8), ('summarisation', 6), ('twitter', 5), ('nice', 5)]

“nice”?

While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. bigrams).

from nltk import bigrams 

terms_bigram = bigrams(terms_stop)

The bigrams() function from NLTK will take a list of tokens and produce a list of tuples using adjacent tokens. Notice that we could use terms_all to compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of n tokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.

So after counting and sorting the bigrams, this is the result:

[(('nice', 'article'), 4), (('extractive', 'summarisation'), 4), (('summarisation', 'sentence'), 3), (('short', 'paper'), 3), (('paper', 'extractive'), 2)]

So apparently I tweet about nice articles (I wouldn’t bother sharing the boring ones) and extractive summarisation (the topic of my PhD dissertation). This also sounds about right.

Summary

This article has built on top of the previous ones to discuss some basis for extracting interesting terms from a data set of tweets, by using simple term frequencies, stop-word removal and n-grams. While these approaches are extremely simple to implement, they are quite useful to have a bird’s eye view on the data. We have used some components of NLTK (introduced in a previous article), so we don’t have to re-invent the wheel.

@MarcoBonzanini

Tutorial Table of Contents:

Part 1: Collecting data
Part 2: Text Pre-processing
Part 3: Term Frequencies (this article)
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

Mining Twitter Data with Python (Part 2: Text Pre-processing)

This is the second part of a series of articles about data mining on Twitter. In the previous episode, we have seen how to collect data from Twitter. In this post, we’ll discuss the structure of a tweet and we’ll start digging into the processing steps we need for some text analysis.

Table of Contents of this tutorial:

Part 1: Collecting data
Part 2: Text Pre-processing (this article)
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

The Anatomy of a Tweet

Assuming that you have collected a number of tweets and stored them in JSON as suggested in the previous article, let’s have a look at the structure of a tweet:

import json

with open('mytweets.json', 'r') as f:
    line = f.readline() # read only the first tweet/line
    tweet = json.loads(line) # load it as Python dict
    print(json.dumps(tweet, indent=4)) # pretty-print

The key attributes are the following:

text: the text of the tweet itself
created_at: the date of creation
favorite_count, retweet_count: the number of favourites and retweets
favorited, retweeted: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet
lang: acronym for the language (e.g. “en” for english)
id: the tweet identifier
place, coordinates, geo: geo-location information if available
user: the author’s full profile
entities: list of entities like URLs, @-mentions, hashtags and symbols
in_reply_to_user_id: user identifier if the tweet is a reply to a specific user
in_reply_to_status_id: status identifier id the tweet is a reply to a specific status

As you can see there’s a lot of information we can play with. All the *_id fields also have a *_id_str counterpart, where the same information is stored as a string rather than a big int (to avoid overflow problems). We can imagine how these data already allow for some interesting analysis: we can check who is most favourited/retweeted, who’s discussing with who, what are the most popular hashtags and so on. Most of the goodness we’re looking for, i.e. the content of a tweet, is anyway embedded in the text, and that’s where we’re starting our analysis.

We start our analysis by breaking the text down into words. Tokenisation is one of the most basic, yet most important, steps in text analysis. The purpose of tokenisation is to split a stream of text into smaller units called tokens, usually words or phrases. While this is a well understood problem with several out-of-the-box solutions from popular libraries, Twitter data pose some challenges because of the nature of the language.

How to Tokenise a Tweet Text

Let’s see an example, using the popular NLTK library to tokenise a fictitious tweet:

from nltk.tokenize import word_tokenize

tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP'
print(word_tokenize(tweet))
# ['RT', '@', 'marcobonzanini', ':', 'just', 'an', 'example', '!', ':', 'D', 'http', ':', '//example.com', '#', 'NLP']

You will notice some peculiarities that are not captured by a general-purpose English tokeniser like the one from NLTK: @-mentions, emoticons, URLs and #hash-tags are not recognised as single tokens. The following code will propose a pre-processing chain that will consider these aspects of the language.

import re

emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""

regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs

    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
   
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)

def tokenize(s):
    return tokens_re.findall(s)

def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP'
print(preprocess(tweet))
# ['RT', '@marcobonzanini', ':', 'just', 'an', 'example', '!', ':D', 'http://example.com', '#NLP']

As you can see, @-mentions, emoticons, URLs and #hash-tags are now preserved as individual tokens.

If we want to process all our tweets, previously saved on file:

with open('mytweets.json', 'r') as f:
    for line in f:
        tweet = json.loads(line)
        tokens = preprocess(tweet['text'])
        do_something_else(tokens)

The tokeniser is probably far from perfect, but it gives you the general idea. The tokenisation is based on regular expressions (regexp), which is a common choice for this type of problem. Some particular types of tokens (e.g. phone numbers or chemical names) will not be captured, and will be probably broken into several tokens. To overcome this problem, as well as to improve the richness of your pre-processing pipeline, you can improve the regular expressions, or even employ more sophisticated techniques like Named Entity Recognition.

The core component of the tokeniser is the regex_str variable, which is a list of possible patterns. In particular, we try to capture some emoticons, HTML tags, Twitter @usernames (@-mentions), Twitter #hashtags, URLs, numbers, words with and without dashes and apostrophes, and finally “anything else”. Please take a moment to observe the regexp for capturing numbers: why don’t we just use \d+? The problem here is that numbers can appear in several different ways, e.g. 1000 can also be written as 1,000 or 1,000.00 — and we can get into more complications in a multi-lingual environment where commas and dots are inverted: “one thousand” can be written as 1.000 or 1.000,00 in many non-anglophone countries. The task of identifying numeric tokens correctly just gives you a glimpse of how difficult tokenisation can be.

The regular expressions are compiled with the flags re.VERBOSE, to allow spaces in the regexp to be ignored (see the multi-line emoticons regexp), and re.IGNORECASE to catch both upper and lowercases. The tokenize() function simply catches all the tokens in a string and returns them as a list. This function is used within preprocess(), which is used as a pre-processing chain: in this case we simply add a lowercasing feature for all the tokens that are not emoticons (e.g. :D doesn’t become :d).

Summary

In this article we have analysed the overall structure of a tweet, and we have discussed how to pre-process the text before we can get into some more interesting analysis. In particular, we have seen how tokenisation, despite being a well-understood problem, can get tricky with Twitter data. The proposed solution is far from perfect but it’s a good starting point, and fairly easy to extend.

@MarcoBonzanini

Table of Contents of this tutorial:

Part 1: Collecting data
Part 2: Text Pre-processing (this article)
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

Mining Twitter Data with Python (Part 1: Collecting data)

Twitter is a popular social network where users can share short SMS-like messages called tweets. Users share thoughts, links and pictures on Twitter, journalists comment on live events, companies promote products and engage with customers. The list of different ways to use Twitter could be really long, and with 500 millions of tweets per day, there’s a lot of data to analyse and to play with.

This is the first in a series of articles dedicated to mining data on Twitter using Python. In this first part, we’ll see different options to collect data from Twitter. Once we have built a data set, in the next episodes we’ll discuss some interesting data applications.

Update July 2016: my new book on data mining for Social Media is out! Part of the content in this tutorial has been improved and expanded as part of the book, so please have a look. Chapter 2 about mining Twitter is available as a free sample from the publisher’s web site, and the companion code with many more examples is available on my GitHub

Table of Contents of this tutorial:

Part 1: Collecting Data (this article)
Part 2: Text Pre-processing
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

More updates: fixed version number of Tweepy to avoid problem with Python 3; fixed discussion on _json to get the JSON representation of a tweet; added example of process_or_store().

Register Your App

In order to have access to Twitter data programmatically, we need to create an app that interacts with the Twitter API.

The first step is the registration of your app. In particular, you need to point your browser to http://apps.twitter.com, log-in to Twitter (if you’re not already logged in) and register a new application. You can now choose a name and a description for your app (for example “Mining Demo” or similar). You will receive a consumer key and a consumer secret: these are application settings that should always be kept private. From the configuration page of your app, you can also require an access token and an access token secret. Similarly to the consumer keys, these strings must also be kept private: they provide the application access to Twitter on behalf of your account. The default permissions are read-only, which is all we need in our case, but if you decide to change your permission to provide writing features in your app, you must negotiate a new access token.

Important Note: there are rate limits in the use of the Twitter API, as well as limitations in case you want to provide a downloadable data-set, see:

Accessing the Data

Twitter provides REST APIs you can use to interact with their service. There is also a bunch of Python-based clients out there that we can use without re-inventing the wheel. In particular, Tweepy in one of the most interesting and straightforward to use, so let’s install it:

pip install tweepy==3.3.0

Update: the release 3.4.0 of Tweepy has introduced a problem with Python 3, currently fixed on github but not yet available with pip, for this reason we’re using version 3.3.0 until a new release is available.

More Updates: the release 3.5.0 of Tweepy, already available via pip, seems to solve the problem with Python 3 mentioned above.

In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:

import tweepy
from tweepy import OAuthHandler

consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

The api variable is now our entry point for most of the operations we can perform with Twitter.

For example, we can read our own timeline (i.e. our Twitter homepage) with:

for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    print(status.text)

Tweepy provides the convenient Cursor interface to iterate through different types of objects. In the example above we’re using 10 to limit the number of tweets we’re reading, but we can of course access more. The status variable is an instance of the Status() class, a nice wrapper to access the data. The JSON response from the Twitter API is available in the attribute _json (with a leading underscore), which is not the raw JSON string, but a dictionary.

So the code above can be re-written to process/store the JSON:

for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    process_or_store(status._json)

What if we want to have a list of all our followers? There you go:

for friend in tweepy.Cursor(api.friends).items():
    process_or_store(friend._json)

And how about a list of all our tweets? Simple:

for tweet in tweepy.Cursor(api.user_timeline).items():
    process_or_store(tweet._json)

In this way we can easily collect tweets (and more) and store them in the original JSON format, fairly easy to convert into different data models depending on our storage (many NoSQL technologies provide some bulk import feature).

The function process_or_store() is a place-holder for your custom implementation. In the simplest form, you could just print out the JSON, one tweet per line:

def process_or_store(tweet):
    print(json.dumps(tweet))

Streaming

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. We need to extend the StreamListener() to customise the way we process the incoming data. A working example that gathers all the new tweets with the #python hashtag:

from tweepy import Stream
from tweepy.streaming import StreamListener

class MyListener(StreamListener):

    def on_data(self, data):
        try:
            with open('python.json', 'a') as f:
                f.write(data)
                return True
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True

    def on_error(self, status):
        print(status)
        return True

twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['#python'])

Depending on the search term, we can gather tons of tweets within a few minutes. This is especially true for live events with a world-wide coverage (World Cups, Super Bowls, Academy Awards, you name it), so keep an eye on the JSON file to understand how fast it grows and consider how many tweets you might need for your tests. The above script will save each tweet on a new line, so you can use the command wc -l python.json from a Unix shell to know how many tweets you’ve gathered.

You can see a minimal working example of the Twitter Stream API in the following Gist:

twitter_stream_downloader.py

Summary

We have introduced tweepy as a tool to access Twitter data in a fairly easy way with Python. There are different types of data we can collect, with the obvious focus on the “tweet” object.

Once we have collected some data, the possibilities in terms of analytics applications are endless. In the next episodes, we’ll discuss some options.

@MarcoBonzanini

Table of Contents of this tutorial:

Part 1: Collecting Data (this article)
Part 2: Text Pre-processing
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

Fuzzy String Matching in Python

Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern.
The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match.
Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).

Fuzzy String Matching can have different practical applications. Typical examples are spell-checking, text re-use detection (the politically correct way of calling plagiarism detection), spam filtering, as well as several applications in the bioinformatics domain, e.g. matching DNA sequences.

This article plays around with fuzzywuzzy, a Python library for Fuzzy String Matching.

Getting Started with Fuzzywuzzy

FuzzyWuzzy has been developed and open-sourced by SeatGeek, a service to find sport and concert tickets. Their original use case, as discussed in their blog, was the problem given by the many different ways of labelling the same event, adding or hiding location, dates, venue, etc. This problem is also arising with different entities like persons or companies.

To install the library, you can use pip as usual:

pip install fuzzywuzzy

The main modules in FuzzyWuzzy are called fuzz, for string-to-string comparisons, and process to compare a string with a list of strings.

Under the hood, FuzzyWuzzy uses difflib, part of the standard library, so there is nothing extra to install. We can anyway benefit from the performance of python-Levenshtein for sequence matching, so let’s also install this library:

pip install python-Levenshtein

Examples of Usage

Firstly, let’s import the main modules:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In order to calculate a similarity score between two strings, we can use the methods ratio() or partial_ratio():

fuzz.ratio("ACME Factory", "ACME Factory Inc.")
# 83
fuzz.partial_ratio("ACME Factory", "ACME Factory Inc.")
# 100

We can see how the ratio() function is confused by the suffix “Inc.” used in company names, but really the two strings refer to the same entity. This is captured by the partial ratio.

More examples:

fuzz.ratio('Barack Obama', 'Barack H. Obama')
# 89
fuzz.partial_ratio('Barack Obama', 'Barack H. Obama')
# 75

fuzz.ratio('Barack H Obama', 'Barack H. Obama')
# 97
fuzz.partial_ratio('Barack H Obama', 'Barack H. Obama')
# 92

Here we observe the opposite behaviour: different variations in Barack Obama’s name produce a lower score for the partial ratio, why is that? Probably because the extra token for the middle name is right in the middle of the string. For this particular case, we can benefit by other functions that tokenise the string and treat it as a set or as a sequence of words:

fuzz.token_sort_ratio('Barack Obama', 'Barack H. Obama')
# 92
fuzz.token_set_ratio('Barack Obama', 'Barack H. Obama')
# 100

fuzz.token_sort_ratio('Barack H Obama', 'Barack H. Obama')
# 100
fuzz.token_set_ratio('Barack H Obama', 'Barack H. Obama')
# 100

The token_* functions split the string on white-spaces, lowercase everything and get rid of non-alpha non-numeric characters, which means punctuation is ignored (as well as weird unicode symbols).

In case we have a list of options and we want to find the closest match(es), we can use the process module:

query = 'Barack Obama'
choices = ['Barack H Obama', 'Barack H. Obama', 'B. Obama']
# Get a list of matches ordered by score, default limit to 5
process.extract(query, choices)
# [('Barack H Obama', 95), ('Barack H. Obama', 95), ('B. Obama', 85)]

# If we want only the top one
process.extractOne(query, choices)
# ('Barack H Obama', 95)

Summary

This article has introduced Fuzzy String Matching, which is a well understood problem with some interesting practical applications.

Python has a very simple option to tackle the problem: the FuzzyWuzzy library, which is built on top of difflib (and python-Levenshtein for speed). It can take a while to figure out how to scope our string matching problem, but the easy interface of fuzzywuzzy should help speeding up the development.

Phrase Match and Proximity Search in Elasticsearch

The case of multi-term queries in Elasticsearch offers some room for discussion, because there are several options to consider depending on the specific use case we’re dealing with.

Multi-term queries are, in their most generic definition, queries with several terms. These terms could be completely unrelated, or they could be about the same topic, or they could even be part of a single specific concept. All these scenarios call for different configurations. This articles discusses some of the options with Elasticsearch.

Sample Documents

Assuming elasticsearch is up and running on your local machine, you can download the script which creates the data set used in the following examples. These are the created documents:

Doc ID	Content
1	This is a brown fox
2	This is a brown dog
3	This dog is really brown
4	The dog is brown but this document is very very long
5	There is also a white cat
6	The quick brown fox jumps over the lazy dog

Notice that for the sake of these examples, we’re using the default configuration which means using the default TF-IDF scoring function from Lucene, that include some score normalisation based on the document length (shorter documents are promoted).

In order to run all the following queries, the basic option is to use curl, e.g.:

curl -XPOST http://localhost:9200/test/articles/_search?pretty=true -d '{THE QUERY CODE HERE}'

although one could embed the query in Python code as discussed in a previous post (or in any programming language that allows to do some REST calls). With the pretty=true parameter, the JSON output will be more readable on the shell.

A General Purpose Query

The first example of query is for the simple case of terms which may or may not be related. In this scenario, we decide to use a classic match query. This kind of query does not impose any restriction between multiple terms, but of course will promote documents which contain more query terms.

{
    "query": {
        "match": {
            "content": {
                "query": "quick brown dog"
            }
        }
     }
}

This query retrieves 5 documents, in this order:

Pos	Doc ID	Content	Score
1	6	The quick brown fox jumps over the lazy dog	0.81502354
2	2	This is a brown dog	0.26816052
3	3	This dog is really brown	0.26816052
4	4	The dog is brown but this document is very very long	0.15323459
5	1	This is a brown fox	0.055916067

We can notice how the first document has a much higher score, because it’s the only one containing all the query terms. The documents in position 2 and 3 share the same score, because they have the same number of matches (two terms) and the same document lenght. The document in fourth position instead, despite having the same number of matches as the previous two, has a lower score because it’s much longer, so the document length normalisation penalises it. We can also notice how the last document has a very low score and it’s probably irrelevant.

Precision on multi-term query can be controlled by specifying some arbitrary threashold for the number of terms which should be matched. For example, we can re-write the query as:

{
    "query": {
        "match": {
            "content": {
                "query": "quick brown dog",
                "minimum_should_match": 75%
            }
        }
     }
}

The output will be basically the same, with the exeption of having only the top four documents. This is because the fifth document, “This is a brown fox”, only matches 1/3 of the query terms, which is below 75%. You can experiment with different thresholds for minimum match, keeping in mind that there is a balance to find between removing unrelevant documents and not losing the relevant ones.

The Case of Phrase Matching

In the previous example, the query terms were completely unrelated, so the query “quick brown dog” also retrieved brown foxes and non-quick dogs. What if we need an exact match of the query? More precisely, what if we need to match all the query terms in their relative position? This is the case for named entities like “New York”, where the two terms individually don’t convey the same meaning as the two of them concatenated in this order.

Elasticsearch has an option for this: match_phrase. The previous query can be rewritten as:

{
    "query": {
        "match_phrase": {
            "content": "quick brown dog"
        }
    }
}

We immediately see that the query returns an empty result set: there is no document about quick brown dogs. Let’s re-write the query in a less restrictive way, dropping the “quick” term:

{
    "query": {
        "match_phrase": {
            "content": "brown dog"
        }
    }
}

Now we can see how the query retrieves only one document, precisely document 2, the only one to match the exact phrase.

Proximity Search

Sometimes a phrase match can be too restrictive. What if we’re not really interested in a precise match, but we’d rather retrieve documents where the query terms occur somehow close to each other. This is an example of proximity search: the order of the terms doesn’t really matter, as long as they occur somehow within the same context. This concept is less restrictive than a pure phrase match, but still stronger than a general purpose query.

In order to achieve proximity search, we simply need to define the search window, so how far we allow the terms to be. This is called slop in Elasticsearch/Lucene terminology. The change to the previous code is really minimal, for example for a slop/window of 3 terms:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 3
            }
        }
    }
}

The result of the query:

Pos	Doc ID	Content	Score
1	2	This is a brown dog	0.9547657
2	4	The dog is brown but this document is very very long	0.2727902

We immediately see that the second document is also relevant to the query, but it was missed by the original phrase match. We can also try with a bigger slop, e.g.:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 4
            }
        }
    }
}

which retrieves the following:

Pos	Doc ID	Content	Score
1	2	This is a brown dog	0.9547657
2	3	This dog is really brown	0.4269842
3	4	The dog is brown but this document is very very long	0.2727902

The new result, document 3, is also still relevant. On the other side, if we keep increasing the slop, at some point we will end up including non-relevant, or less relevant, results. It’s hence important to understand the needs of the specific scenario and find a balance between not missing relevant results and including non-relevant results.

Within-Sentence Proximity Search

A variation of the proximity search discussed above consists in the need to match terms occurring in a specific context. Such a context could be the same sentence, the same paragraph, the same section, etc. The difference with what we already discussed in the previous paragraph is that here we might have a specific structure (sections, sentences, …) but not a specific window/slop size in mind.

Let’s assume the “content” field of our documents is a list of sentences, so we want to perform proximity search within a sentence. An example of document with two sentences:

{
    "content": ["This is a brown fox", "This is white dog"]
}

The trick to allow within-sentence search is to define a slop which is big enough to capture the sentence length, and to use it as a position offset:

{
    "properties": {
        "content": {
            "type": "string",
            "position_offset_gap": 100
        }
    }
}

Here the value 100 is arbitrary and is “big enough”. Pushing this configuration to our index, we force the terms to jump 100 positions ahead when there is a new sentence. In the previous document, if the term “fox” is in position 5, the following term “This” will be in position 106 rather than 6, because it’s in a new sentence. You can download the full script to implement sentence-based proximity search, with the updated documents to reflect the sentence structure, keeping in mind that applying this option to an existing data set requires re-indexing.

The value of the position offset can now be used as slop value:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 100
            }
        }
    }
}

This query will return four documents. More precisely, document 1, which mentions a white dog and a brown fox, will not be retrieved, because the two terms appear in different sentences.

Summary

We have explored some options from Elasticsearch to improve the results for queries with multiple terms. We have seen how Elasticsearch provides these functionality in a fairly easy way. The starting point is to understand the specific use case that we’re trying to tackle, and from here we have a set of choices. Depending on the scenario, we might want to choose one between:

a simple match search
a match search with a minimum match ratio
a phrase-based match
a phrase match with a slop for proximity search
a phrase match with a slop which matches the position offset specified in the index, for sentence-based (or any other context-based) proximity search

PyData London Meetup 2015-02-03

A quick update with my impressions on the last PyData London meet-up I attended this evening.

I missed the past couple of meet-ups because of some clash on my personal schedule, so this was my first time in the new venue, still close to Old Street, hosted by Lyst. A nice big room to accomodate many many people (more than 200 probably?), and a nice selection of craft beers.

We started with the usual initial community-related announcements, including the impressive achievement of the group: 1,000+ members on the meetup.com dedicated page! Well done guys!

The core topic of the evening was data visualisation. For this reason, the main candidate for Python Module Of The Month was (… drum roll …) matplotlib :-D

The first talk, “Thinking about data visualisation”, was given by Andy Kirk from Visualising Data. Thinking was an important keyword of the presentation: after an initial comparison between talent (e.g. artistic, creative) vs thinking, Andy went on discussing different aspects of thinking and how our thinking can provide more value to the visualisation we are proposing. Long story short, you don’t have to be an artist to create useful visualisation, assuming you take the context and the overall story into account.

The second talk, “Lies, damned lies and dataviz”, was given by Andrew Clegg from Etsy. Andrew started the presentation with some argument in support of providing good dataviz, and then he continued with a great sequence of examples of bad dataviz, discussing how some of them simply confuse the user without providing any additional insight, while others really trick the user into wrong conclusions. Whether these are damned lies or just bad design choices… well that’s another story.

Both presentations were really great, and both of them technology-agnostic, which is something I enjoyed for some reason.

During the final lightning-and-community-talks, the interesting news was the announcement of a new O’Reilly book on data visualisation with Python and Javascript (link to come).

Overall a very nice evening, congratulations to all the organisers.

PyData London meetup.com group:
http://www.meetup.com/PyData-London-Meetup/

How to Query Elasticsearch with Python

Elasticsearch is an open-source distributed search server built on top of Apache Lucene. It’s a great tool that allows to quickly build applications with full-text search capabilities. The core implementation is in Java, but it provides a nice REST interface which allows to interact with Elasticsearch from any programming language.

This article provides an overview on how to query Elasticsearch from Python. There are two main options:

Implement the REST-API calls to Elasticsearch
Use one of the Python libraries that does the above for you

Quick Intro on Elasticsearch

Elasticsearch is developed in Java on top of Lucene, but the format for configuring the index and querying the server is JSON. Once the server is running, by default it’s accessible at localhost:9200 and we can start sending our commands via e.g. curl:

curl -XPOST http://localhost:9200/test/articles/1 -d '{
    "content": "The quick brown fox"
}'

This commands creates a new document, and since the index didn’t exist, it also creates the index. Specifically, the format for the URL is:

http://hostname:port/index_name/doc_type/doc_id

so we have just created an index “test” which contains documents of type “articles”. The document has only one field, “content”. Since we didn’t specify, the content is indexed using the default Lucene analyzer (which is usually a good choice for standard English). The document id is optional and if we don’t explicitly give one, the server will create a random hash-like one.

We can insert a few more documents, see for example the file create_index.sh from the code snippets on github.

Once the documents are indexed, we can perform a simple search, e.g.:

curl -XPOST http://localhost:9200/test/articles/_search?pretty=true -d '{
    "query": {
        "match": {
            "content": "dog"
        }
    }
}'

Using the sample documents above, this query should return only one document. Performing the same query over the term “fox” rather than “dog” should give instead four documents, ranked according to their relevance.

How the Elasticsearch/Lucene ranking function works, and all the countless configuration options for Elasticsearch, are not the focus of this article, so bear with me if we’re not digging into the details. For the moment, we’ll just focus on how to integrate/query Elasticsearch from our Python application.

Querying Elasticsearch via REST in Python

One of the option for querying Elasticsearch from Python is to create the REST calls for the search API and process the results afterwards. The requests library is particularly easy to use for this purpose. We can install it with:

pip install requests

The sample query used in the previous section can be easily embedded in a function:

def search(uri, term):
    """Simple Elasticsearch Query"""
    query = json.dumps({
        "query": {
            "match": {
                "content": term
            }
        }
    })
    response = requests.get(uri, data=query)
    results = json.loads(response.text)
    return results

The “results” variable will be a dictionary loaded from the JSON response. We can pretty-print the JSON, to observe the full output and understand all the information it provides, but again this is beyond the scope of this post. So we can simply print the results nicely, one document per line, as follows:

def format_results(results):
    """Print results nicely:
    doc_id) content
    """
    data = [doc for doc in results['hits']['hits']]
    for doc in data:
        print("%s) %s" % (doc['_id'], doc['_source']['content']))

Similarly, we can create new documents:

def create_doc(uri, doc_data={}):
    """Create new document."""
    query = json.dumps(doc_data)
    response = requests.post(uri, data=query)
    print(response)

with the doc_data variable being a (Python) dictionary which resembles the structure of the document we’re creating.

You can see a full working toy example in the rest.py file in the Gist on github.

Querying Elasticsearch Using elasticsearch-py

The requests library is fairly easy to use, but there are several options in terms of libraries that abstract away the concepts related to the REST API and focus on Elasticsearch concepts. In particular, the official Python extension for Elasticsearch, called elasticsearch-py, can be installed with:

pip install elasticsearch

It’s fairly low-level compared to other client libraries with similar capabilities, but it provides a consistent and easy to extend API.

We can replicate the search used with the requests library, as well as the result print-out, just using a few lines of Python:

from elasticsearch import Elasticsearch

es = Elasticsearch()
res = es.search(index="test", doc_type="articles", body={"query": {"match": {"content": "fox"}}})
print("%d documents found" % res['hits']['total'])
for doc in res['hits']['hits']:
    print("%s) %s" % (doc['_id'], doc['_source']['content']))

In a similar fashion, we can re-create the functionality of adding an extra document:

es.create(index="test", doc_type="articles", body={"content": "One more fox"})

The full functionality of this client library are well described in the documentation.

Summary

This article has briefly discussed a couple of options to integrate Elasticsearch into a Python application. The key points of the discussion are:

We can interact with Elasticsearch using the REST API
The requests library is particularly useful for this purpose, and probably much cleaner and easier to use than the urllib module (part of the standard library)
Many other Python libraries implement an Elasticsearch client, abstracting away the concept related to the REST API and focusing on Elasticsearch concepts
We have seen simple examples with elasticsearch-py

The full code for the examples is available as usual in a Gist:
https://gist.github.com/bonzanini/fe2ff32116f16e3009be

Stemming, Lemmatisation and POS-tagging with Python and NLTK

This article describes some pre-processing steps that are commonly used in Information Retrieval (IR), Natural Language Processing (NLP) and text analytics applications.

In particular, the focus is on the comparison between stemming and lemmatisation, and the need for part-of-speech tagging in this context. The discussion shows some examples in NLTK, also as
Gist on github.

Stemming

Stemming is the process of reducing a word into its stem, i.e. its root form. The root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix.

For example, the words fish, fishes and fishing all stem into fish, which is a correct word. On the other side, the words study, studies and studying stems into studi, which is not an English word.

Most commonly, stemming algorithms (a.k.a. stemmers) are based on rules for suffix stripping.
The most famous example is the Porter stemmer, introduced in the 1980’s and currently implemented in a variety of programming languages.

Traditionally, search engines and other IR applications have applied stemming to improve the chance of matching different forms of a word, almost treating them like synonyms, as conceptually they “belong” together.

Lemmatisation

The purpose of Lemmatisation is to group together different inflected forms of a word, called lemma. The process is somehow similar to stemming, as it maps several words into one common root. The output of lemmatisation is a proper word, and basic suffix stripping wouldn’t provide the same outcome. For example, a lemmatiser should map gone, going and went into go. In order to achieve its purpose, lemmatisation requires to know about the context of a word, because the process relies on whether the word is a noun, a verb, etc.

Part-of-speech Tagging

Part-of-speech (POS) tagging is the process of assigning a word to its grammatical category, in order to understand its role within the sentence. Traditional parts of speech are nouns, verbs, adverbs, conjunctions, etc.

Part-of-speech taggers typically take a sequence of words (i.e. a sentence) as input, and provide a list of tuples as output, where each word is associated with the related tag.

Part-of-speech tagging is what provides the contextual information that a lemmatiser needs to choose the appropriate lemma.

Examples in Python and NLTK

One of the most popular packages for NLP in Python is the Natural Language Toolkit (NLTK). It includes several tools for text analytics, as well as training data for some of the tools, and also some well-known data sets.

To install NLTK:

pip install nltk

In order to install the additional data, you can use its internal tool. From a Python interactive shell, simply type:

import nltk
nltk.download()

This will open a GUI which you can use to choose which data you want to download (if you’re not using a GUI environment, the interface will be textual). In a dev environment, I normally just download all the data for all the packages in the default folder ($HOME/nltk_data) but you can personalise
your installation.

A full example of stemming, lemmatisation and POS-tagging is available as Gist on github.

Let’s focus on this snippet:

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatiser = WordNetLemmatizer()

print("Stem %s: %s" % ("studying", stemmer.stem("studying")))
print("Lemmatise %s: %s" % ("studying", lemmatiser.lemmatize("studying")))
print("Lemmatise %s: %s" % ("studying", lemmatiser.lemmatize("studying", pos="v")))

The output will be:

Stem studying: studi
Lemmatise studying: studying
Lemmatise studying: study

We can observe that the stemming process doesn’t generate a real word, but a root form.
On the other side, the lemmatiser generates real words, but without contextual information it’s not able to distinguish between nouns and verbs, hence the lemmatisation process doesn’t change
the word. The context is provided by the POS tag (“v” for verb in this example).

In order to generate POS tags automatically, nltk comes with a simple function. The snippet for POS tagging:

from nltk import pos_tag
from nltk.tokenize import word_tokenize

s = "This is a simple sentence"
tokens = word_tokenize(s) # Generate list of tokens
tokens_pos = pos_tag(tokens) 

print(tokens_pos)

and the output will be:

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('simple', 'JJ'), ('sentence', 'NN')]

NLTK uses the set of tags from the Penn Treebank project.

Summary

Stemming, lemmatisation and POS-tagging are important pre-processing steps in many text analytics applications. You can get up and running very quickly and include these capabilities in your Python applications by using the off-the-shelf solutions in offered by NLTK.