This is the third part in a series of articles about data mining on Twitter. After collecting data and pre-processing some text, we are ready for some basic analysis. In this article, we’ll discuss the analysis of term frequencies to extract meaningful terms from our tweets.
Tutorial Table of Contents:
- Part 1: Collecting data
- Part 2: Text Pre-processing
- Part 3: Term Frequencies (this article)
- Part 4: Rugby and Term Co-Occurrences
- Part 5: Data Visualisation Basics
- Part 6: Sentiment Analysis Basics
- Part 7: Geolocation and Interactive Maps
Assuming we have collected a list of tweets (see Part 1 of the tutorial), the first exploratory analysis that we can perform is a simple word count. In this way, we can observe what are the terms most commonly used in the data set. In this example, I’ll use the set of my tweets, so the most frequent words should correspond to the topics I discuss (not necessarily, but bear with be for a couple of paragraphs).
We can use a custom tokeniser to split the tweets into a list of terms. The following code uses the preprocess() function described in Part 2 of the tutorial, in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions, emoticons and URLs. In order to keep track of the frequencies while we are processing the tweets, we can use collections.Counter() which internally is a dictionary (term: count) with some useful methods like most_common():
import operator import json from collections import Counter fname = 'mytweets.json' with open(fname, 'r') as f: count_all = Counter() for line in f: tweet = json.loads(line) # Create a list with all the terms terms_all = [term for term in preprocess(tweet['text'])] # Update the counter count_all.update(terms_all) # Print the first 5 most frequent words print(count_all.most_common(5))
The above code will produce some unimpressive results:
[(':', 44), ('rt', 26), ('to', 26), ('and', 25), ('on', 22)]
As you can see, the most frequent words (or should I say, tokens), are not exactly meaningful.
In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called stop-words. In the example above, we can see three common stop-words – to, and and on. Stop-word removal is one important step that should be considered during the pre-processing stages. One can build a custom list of stop-words, or use available lists (e.g. NLTK provides a simple list for English stop-words).
Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.
from nltk.corpus import stopwords import string punctuation = list(string.punctuation) stop = stopwords.words('english') + punctuation + ['rt', 'via']
We can now substitute the variable terms_all in the first example with something like:
terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]
After counting, sorting the terms and printing the top 5, this is the result:
[('python', 11), ('@miguelmalvarez', 9), ('#python', 9), ('data', 8), ('@danielasfregola', 7)]
More term filters
Besides stop-word removal, we can further customise the list of terms/tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:
# Count terms only once, equivalent to Document Frequency terms_single = set(terms_all) # Count hashtags only terms_hash = [term for term in preprocess(tweet['text']) if term.startswith('#')] # Count terms only (no hashtags, no mentions) terms_only = [term for term in preprocess(tweet['text']) if term not in stop and not term.startswith(('#', '@'))] # mind the ((double brackets)) # startswith() takes a tuple (not a list) if # we pass a list of inputs
After counting and sorting, these are my most commonly used hashtags:
[('#python', 9), ('#scala', 6), ('#nosql', 4), ('#bigdata', 3), ('#nlp', 3)]
and these are my most commonly used terms:
[('python', 11), ('data', 8), ('summarisation', 6), ('twitter', 5), ('nice', 5)]
While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. bigrams).
from nltk import bigrams terms_bigram = bigrams(terms_stop)
The bigrams() function from NLTK will take a list of tokens and produce a list of tuples using adjacent tokens. Notice that we could use terms_all to compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of n tokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.
So after counting and sorting the bigrams, this is the result:
[(('nice', 'article'), 4), (('extractive', 'summarisation'), 4), (('summarisation', 'sentence'), 3), (('short', 'paper'), 3), (('paper', 'extractive'), 2)]
So apparently I tweet about nice articles (I wouldn’t bother sharing the boring ones) and extractive summarisation (the topic of my PhD dissertation). This also sounds about right.
This article has built on top of the previous ones to discuss some basis for extracting interesting terms from a data set of tweets, by using simple term frequencies, stop-word removal and n-grams. While these approaches are extremely simple to implement, they are quite useful to have a bird’s eye view on the data. We have used some components of NLTK (introduced in a previous article), so we don’t have to re-invent the wheel.
Tutorial Table of Contents: