## Mining Twitter Data with Python (Part 6 – Sentiment Analysis Basics)

Sentiment Analysis is one of the interesting applications of text analytics. Although the term is often associated with sentiment classification of documents, broadly speaking it refers to the use of text analytics approaches applied to the set of problems related to identifying and extracting subjective material in text sources.

This article continues the series on mining Twitter data with Python, describing a simple approach for Sentiment Analysis and applying it to the rubgy data set (see Part 4).

## A Simple Approach for Sentiment Analysis

The technique we’re discussing in this post has been elaborated from the traditional approach proposed by Peter Turney in his paper Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. A lot of work has been done in Sentiment Analysis since then, but the approach has still an interesting educational value. In particular, it is intuitive, simple to understand and to test, and most of all unsupervised, so it doesn’t require any labelled data for training.

Firstly, we define the Semantic Orientation (SO) of a word as the difference between its associations with positive and negative words. In practice, we want to calculate “how close” a word is with terms like good and bad. The chosen measure of “closeness” is Pointwise Mutual Information (PMI), calculated as follows (t1 and t2 are terms):

$\mbox{PMI}(t_1, t_2) = \log\Bigl(\frac{P(t_1 \wedge t_2)}{P(t_1) \cdot P(t_2)}\Bigr)$

In Turney’s paper, the SO of a word was calculated against excellent and poor, but of course we can extend the vocabulary of positive and negative terms. Using $V^{+}$ and a vocabulary of positive terms and $V^{-}$ for the negative ones, the Semantic Orientation of a term t is hence defined as:

$\mbox{SO}(t) = \sum_{t' \in V^{+}}\mbox{PMI}(t, t') - \sum_{t' \in V^{-}}\mbox{PMI}(t, t')$

We can build our own list of positive and negative terms, or we can use one of the many resources available on-line, for example the opinion lexicon by Bing Liu.

## Computing Term Probabilities

In order to compute $P(t)$ (the probability of observing the term t) and $P(t_1 \wedge t_2)$ (the probability of observing the terms t1 and t2 occurring together) we can re-use some previous code to calculate term frequencies and term co-occurrences. Given the set of documents (tweets) D, we define the Document Frequency (DF) of a term as the number of documents where the term occurs. The same definition can be applied to co-occurrent terms. Hence, we can define our probabilities as:

$P(t) = \frac{\mbox{DF}(t)}{|D|}\\ P(t_1 \wedge t_2) = \frac{\mbox{DF}(t_1 \wedge t_2)}{|D|}$

In the previous articles, the document frequency for single terms was stored in the dictionaries count_single and count_stop_single (the latter doesn’t store stop-words), while the document frequency for the co-occurrencies was stored in the co-occurrence matrix com

This is how we can compute the probabilities:

# n_docs is the total n. of tweets
p_t = {}
p_t_com = defaultdict(lambda : defaultdict(int))

for term, n in count_stop_single.items():
p_t[term] = n / n_docs
for t2 in com[term]:
p_t_com[term][t2] = com[term][t2] / n_docs


## Computing the Semantic Orientation

Given two vocabularies for positive and negative terms:

positive_vocab = [
'good', 'nice', 'great', 'awesome', 'outstanding',
'fantastic', 'terrific', ':)', ':-)', 'like', 'love',
# shall we also include game-specific terms?
# 'triumph', 'triumphal', 'triumphant', 'victory', etc.
]
negative_vocab = [
'bad', 'terrible', 'crap', 'useless', 'hate', ':(', ':-(',
# 'defeat', etc.
]


We can compute the PMI of each pair of terms, and then compute the
Semantic Orientation as described above:

pmi = defaultdict(lambda : defaultdict(int))
for t1 in p_t:
for t2 in com[t1]:
denom = p_t[t1] * p_t[t2]
pmi[t1][t2] = math.log2(p_t_com[t1][t2] / denom)

semantic_orientation = {}
for term, n in p_t.items():
positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)
semantic_orientation[term] = positive_assoc - negative_assoc


The Semantic Orientation of a term will have a positive (negative) value if the term is often associated with terms in the positive (negative) vocabulary. The value will be zero for neutral terms, e.g. the PMI’s for positive and negative balance out, or more likely a term is never observed together with other terms in the positive/negative vocabularies.

We can print out the semantic orientation for some terms:

semantic_sorted = sorted(semantic_orientation.items(),
key=operator.itemgetter(1),
reverse=True)
top_pos = semantic_sorted[:10]
top_neg = semantic_sorted[-10:]

print(top_pos)
print(top_neg)
print("ITA v WAL: %f" % semantic_orientation['#itavwal'])
print("SCO v IRE: %f" % semantic_orientation['#scovire'])
print("ENG v FRA: %f" % semantic_orientation['#engvfra'])
print("#ITA: %f" % semantic_orientation['#ita'])
print("#FRA: %f" % semantic_orientation['#fra'])
print("#SCO: %f" % semantic_orientation['#sco'])
print("#ENG: %f" % semantic_orientation['#eng'])
print("#WAL: %f" % semantic_orientation['#wal'])
print("#IRE: %f" % semantic_orientation['#ire'])


Different vocabularies will produce different scores. Using the opinion lexicon from Bing Liu, this is what we can observed on the Rugby data-set:

# the top positive terms
[('fantastic', 91.39950482011552), ('@dai_bach', 90.48767241244532), ('hoping', 80.50247748725415), ('#it', 71.28333427277785), ('days', 67.4394844955977), ('@nigelrefowens', 64.86112716005566), ('afternoon', 64.05064208341855), ('breathtaking', 62.86591435212975), ('#wal', 60.07283361352875), ('annual', 58.95378954406133)]
# the top negative terms
[('#england', -74.83306534609066), ('6', -76.0687215594536), ('#itavwal', -78.4558633116863), ('@rbs_6_nations', -80.89363516601993), ("can't", -81.75379628180468), ('like', -83.9319149443813), ('10', -85.93073078165587), ('italy', -86.94465165178258), ('#engvfra', -113.26188957010228), ('ball', -161.82146824640125)]
# Matches
ITA v WAL: -78.455863
SCO v IRE: -73.487661
ENG v FRA: -113.261890
# Individual team
#ITA: 53.033824
#FRA: 14.099372
#SCO: 4.426723
#ENG: -0.462845
#WAL: 60.072834
#IRE: 19.231722

## Some Limitations

The PMI-based approach has been introduced as simple and intuitive, but of course it has some limitations. The semantic scores are calculated on terms, meaning that there is no notion of “entity” or “concept” or “event”. For example, it would be nice to aggregate and normalise all the references to the team names, e.g. #ita, Italy and Italia should all contribute to the semantic orientation of the same entity. Moreover, do the opinions on the individual teams also contribute to the overall opinion on a match?

Some aspects of natural language are also not captured by this approach, more notably modifiers and negation: how do we deal with phrases like not bad (this is the opposite of just bad) or very good (this is stronger than just good)?

## Summary

This article has continued the tutorial on mining Twitter data with Python introducing a simple approach for Sentiment Analysis, based on the computation of a semantic orientation score which tells us whether a term is more closely related to a positive or negative vocabulary. The intuition behind this approach is fairly simple, and it can be implemented using Pointwise Mutual Information as a measure of association. The approach has of course some limitations, but it’s a good starting point to get familiar with Sentiment Analysis.

@MarcoBonzanini

## How to Promote Recent Articles in Elasticsearch

The default ranking options in Elasticsearch (and Lucene) are purely based on content. Sometimes, it is useful to mix other factors in, such as the location/distance of a restaurant, to promote venues close to the user’s location, or the publication date of an article, to promote recent articles.

This post discusses a simple solution which is already integrated in Elasticsearch and can be enabled at query time: a decay function.

## Default Ranking in Elasticsearch

The default ranking function (called similarity in Lucene terminology) is a normalised version of TF-IDF. The normalisation is based on document length and promotes shorter documents over longer ones. Informally, the idea behind this approach is that longer documents will have higher TF’s just because they have more content, so something should be done to avoid penalising short documents. In practice, this works well most of the time, but in case this normalised TF-IDF is not what you need, there are plenty of other options, including BM25, Divergence From Randomness or Language Model. All these ranking functions are based on the content of a field, i.e. they are based on term statistics.

To showcase how the default ranking works, and how we can affect it later, I’ll use a sample database of articles I’ve published on this blog (click here for the Gist). The Gist will create an index on a local instance of Elasticsearch, where each document will have the fields title (string) and published (date).

Once the index is created, we can run a very basic query:

{
"query": {
"match": {
"title": "python"
}
}
}

The query will look for the term python in the title field, and you should see how the documents ranking higher are the ones with shorter titles. This is because the term python occurs only once in each title, so what makes the difference in terms of scoring is the document length normalisation.

## Function Score and Decay Functions

Elasticsearch provides an extensive support for custom scoring via the query DSL, meaning that relevance can be tweaked at query time without re-indexing. Using a function_score query, we can smooth the default scoring function to include freshness (a.k.a. recency) as a component for relevance.

Specifically, the use of some scoring functions called decay functions (namely gaussian, exponential and linear) is the key to integrate numeric fields, date fields and geo fields in the picture.

As an example, let’s consider the following query:

{
"query": {
"function_score": {
"query": {
"match": { "title": "python"}
},
"gauss": {
"published": {
"scale":  "8w"
}
}
}
}
}

This query produces the same result set as the previous one, but the ranking will be very different. In particular, recent articles will be pushed on top. The scale parameter is what governs how quickly the score drops when moving away from the origin. In this example, we have used a range of 8 weeks, meaning that articles older than that will have a very low score.

## Summary

Content is the key aspect to consider in most ranking problems which involve textual data. Sometimes though, content is not the only aspect to consider. This article has showcased decay functions as a simple Elasticsearch option to influence ranking by including freshness/recency as a core component to be mixed with content. Since decay functions work on numeric fields, date fields and geo-points, they can be used to integrate location/distance, recency and other numerical aspects like popularity (e.g. number of votes) into the ranking function. This can be done at query time, without the need to re-index.