Last Saturday was the closing day of the Six Nations Championship, an annual international rugby competition. Before turning on the TV to watch Italy being trashed by Wales, I decided to use this event to collect some data from Twitter and perform some exploratory text analysis on something more interesting than the small list of my tweets.
This article continues the tutorial on Twitter Data Mining, re-using what we discussed in the previous articles with some more realistic data. It also expands the analysis by introducing the concept of term co-occurrence.
Tutorial Table of Contents:
- Part 1: Collecting data
- Part 2: Text Pre-processing
- Part 3: Term Frequencies
- Part 4: Rugby and Term Co-Occurrences (this article)
- Part 5: Data Visualisation Basics
- Part 6: Sentiment Analysis Basics
- Part 7: Geolocation and Interactive Maps
The Application Domain
As the name suggests, six teams are involved in the competition: England, Ireland, Wales, Scotland, France and Italy. This means that we can expect the event to be tweeted in multiple languages (English, French, Italian, Welsh, Gaelic, possibly other languages as well), with English being the major language. Assuming the team names will be mentioned frequently, we could decide to look also for their nicknames, e.g. Les Bleus for France or Azzurri for Italy. During the last day of the competition, three matches are played sequentially. Three teams in particular had a shot for the title: England, Ireland and Wales. At the end, Ireland won the competition but everything was open until the very last minute.
Setting Up
I used the streaming API to download all the tweets containing the string #rbs6nations during the day. Obviously not all the tweets about the event contained the hashtag, but this is a good baseline. The time frame for the download was from around 12:15PM to 7:15PM GMT, that is from about 15 minutes before the first match, to about 15 minutes after the last match was over. At the end, more than 18,000 tweets have been downloaded in JSON format, making for about 75Mb of data. This should be small enough to quickly do some processing in memory, and at the same time big enough to observe something possibly interesting.
The textual content of the tweets has been pre-processed with tokenisation and lowercasing using the preprocess() function introduced in Part 2 of the tutorial.
Interesting terms and hashtags
Following what we discussed in Part 3 (Term Frequencies), we want to observe the most common terms and hashtags used during day. If you have followed the discussion about creating different lists of tokens in order to capture terms without hashtags, hashtags only, removing stop-words, etc. you can play around with the different lists.
This is the unsurprising list of top 10 most frequent terms (terms_only in Part 3) in the data set.
[('ireland', 3163), ('england', 2584), ('wales', 2271), ('ā¦', 2068), ('day', 1479), ('france', 1380), ('win', 1338), ('rugby', 1253), ('points', 1221), ('title', 1180)]
The first three terms correspond to the teams who had a go for the title. The frequencies also respect the order in the final table. The fourth term is instead a punctuation mark that we missed and didn’t include in the list of stop-words. This is because string.punctuation only contains ASCII symbols, while here we’re dealing with a unicode character. If we dig into the data, there will be more examples like this, but for the moment we don’t worry about it.
After adding the suspension-points symbol to the list of stop-words, we have a new entry at the end of the list:
[('ireland', 3163), ('england', 2584), ('wales', 2271), ('day', 1479), ('france', 1380), ('win', 1338), ('rugby', 1253), ('points', 1221), ('title', 1180), ('š', 1154)]
Interestingly, a new token we didn’t account for, an Emoji symbol (in this case, the Irish Shamrock).
If we have a look at the most common hashtags, we need to consider that #rbs6nations will be by far the most common token (that’s our search term for downloading the tweets), so we can exclude it from the list. This leave us with:
[('#engvfra', 1701), ('#itavwal', 927), ('#rugby', 880), ('#scovire', 692), ('#ireland', 686), ('#angfra', 554), ('#xvdefrance', 508), ('#crunch', 500), ('#wales', 446), ('#england', 406)]
We can observe that the most common hashtags, a part from #rugby, are related to the individual matches. In particular England v France has received the highest number of mentions, probably being the last match of the day with a dramatic finale. Something interesting to notice is that a fair amount of tweets also contained terms in French: the count for #angfra should in fact be added to #engvfra. Those unfamiliar with rugby probably wouldn’t recognise that also #crunch should be included with #EngvFra match, as Le Crunch is the traditional name for this event. So by far, the last match has received a lot of attention.
Term co-occurrences
Sometimes we are interested in the terms that occur together. This is mainly because the context gives us a better insight about the meaning of a term, supporting applications such as word disambiguation or semantic similarity. We discussed the option of using bigrams in the previous article, but we want to extend the context of a term to the whole tweet.
We can refactor the code from the previous article in order to capture the co-occurrences. We build a co-occurrence matrix com such that com[x][y] contains the number of times the term x has been seen in the same tweet as the term y:
from collections import defaultdict # remember to include the other import from the previous post com = defaultdict(lambda : defaultdict(int)) # f is the file pointer to the JSON data set for line in f: tweet = json.loads(line) terms_only = [term for term in preprocess(tweet['text']) if term not in stop and not term.startswith(('#', '@'))] # Build co-occurrence matrix for i in range(len(terms_only)-1): for j in range(i+1, len(terms_only)): w1, w2 = sorted([terms_only[i], terms_only[j]]) if w1 != w2: com[w1][w2] += 1
While building the co-occurrence matrix, we don’t want to count the same term pair twice, e.g. com[A][B] == com[B][A], so the inner for loop starts from i+1 in order to build a triangular matrix, while sorted will preserve the alphabetical order of the terms.
For each term, we then extract the 5 most frequent co-occurrent terms, creating a list of tuples in the form ((term1, term2), count):
com_max = [] # For each term, look for the most common co-occurrent terms for t1 in com: t1_max_terms = sorted(com[t1].items(), key=operator.itemgetter(1), reverse=True)[:5] for t2, t2_count in t1_max_terms: com_max.append(((t1, t2), t2_count)) # Get the most frequent co-occurrences terms_max = sorted(com_max, key=operator.itemgetter(1), reverse=True) print(terms_max[:5])
The results:
[(('6', 'nations'), 845), (('champions', 'ireland'), 760), (('nations', 'rbs'), 742), (('day', 'ireland'), 731), (('ireland', 'wales'), 674)]
This implementation is pretty straightforward, but depending on the data set and on the use of the matrix, one might want to look into tools like scipy.sparse for building a sparse matrix.
We could also look for a specific term and extract its most frequent co-occurrences. We simply need to modify the main loop including an extra counter, for example:
search_word = sys.argv[1] # pass a term as a command-line argument count_search = Counter() for line in f: tweet = json.loads(line) terms_only = [term for term in preprocess(tweet['text']) if term not in stop and not term.startswith(('#', '@'))] if search_word in terms_only: count_search.update(terms_only) print("Co-occurrence for %s:" % search_word) print(count_search.most_common(20))
The outcome for “ireland”:
[('champions', 756), ('day', 727), ('nations', 659), ('wales', 654), ('2015', 638), ('6', 613), ('rbs', 585), ('http://t.co/y0nvsvayln', 559), ('š', 526), ('10', 522), ('win', 377), ('england', 377), ('twickenham', 361), ('40', 360), ('points', 356), ('sco', 355), ('ire', 355), ('title', 346), ('scotland', 301), ('turn', 295)]
The outcome for “rugby”:
[('day', 476), ('game', 160), ('ireland', 143), ('england', 132), ('great', 105), ('today', 104), ('best', 97), ('well', 90), ('ever', 89), ('incredible', 87), ('amazing', 84), ('done', 82), ('amp', 71), ('games', 66), ('points', 64), ('monumental', 58), ('strap', 56), ('world', 55), ('team', 55), ('http://t.co/bhmeorr19i', 53)]
Overall, quite interesting.
Summary
This article has discussed a toy example of Text Mining on Twitter, using some realistic data taken during a sport event. Using what we have learnt in the previous episodes, we have downloaded some data using the streaming API, pre-processed the data in JSON format and extracted some interesting terms and hashtags from the tweets. The article has also introduced the concept of term co-occurrence, shown how to build a co-occurrence matrix and discussed how to use it to find some interesting insight.
- Part 1: Collecting data
- Part 2: Text Pre-processing
- Part 3: Term Frequencies
- Part 4: Rugby and Term Co-Occurrences (this article)
- Part 5: Data Visualisation Basics
- Part 6: Sentiment Analysis Basics
- Part 7: Geolocation and Interactive Maps
Happy New Year! Ripping blog – keep it up!
LikeLike
For the newer programmer following along, don’t forget to import defaultdict from collections:
from collections import defaultdict
LikeLike
Thanks Dave, I’ve updated the post, good catch
Cheers,
Marco
LikeLike
Hii Marco
How can I download the tweet between specific time like you have downloaded between 12:15PM to 7:15PM GMT.
Also, what can I do to download the tweets from the time which is already passed like if I want to download yesterday tweets?
LikeLike
Hello, Marcos, for some reason I’m getting this error:
Traceback (most recent call last):
File “C:\Users\Afonso\Documents\Workspaces\Workspace JAVA\Mineracao\Projeto.py”, line 207, in
tweet = json.loads(line)
File “C:\Users\Afonso\AppData\Local\Programs\Python\Python35\lib\json\__init__.py”, line 319, in loads
return _default_decoder.decode(s)
File “C:\Users\Afonso\AppData\Local\Programs\Python\Python35\lib\json\decoder.py”, line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “C:\Users\Afonso\AppData\Local\Programs\Python\Python35\lib\json\decoder.py”, line 357, in raw_decode
raise JSONDecodeError(“Expecting value”, s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
My archive is working well on all the other codes, but in this part of the Article it complain about the json.line(), please help!
LikeLike
Hello Marco!
Can you explain what this line of code does?
t1_max_terms = max(com[t1].items(), key=operator.itemgetter(1))[:5]
Doesn’t max() return a single maximum value?? Why have you used [0:5] at the end?? Are you actully sorting the dict using itemgetter(1) ?? All in all, what exactly is t1_max_terms??
Thanks!
LikeLike
Hi, you are correct, max() returns a single value. In fact, that was some copy&paste mistake of mine from previous experiments, thanks for spotting it. The correct function to use is sorted(), so t1_max_terms is the list of (five) terms with the highest co-occurrence frequency for the term t1. I’ve updated the snippet accordingly
Best regards,
Marco
LikeLike
this line
print("Co-occurrence for %s:" % search_word)
is giving me a syntax error(invalid syntax) using python 2.7 and i am unable to sort it out , any help is appreciated
LikeLike
Hi,
wordpress from time to time decides to do encode html entities without asking :)
I’ve fixed the line in the article, thanks for spotting it:
Side comment, the code is developed and tested for Python 3.4/3.5 so some incompatibilities with Python 2 might come up occasionally.
Cheers,
Marco
LikeLike
thank you so much, another question came up while analyzing these codes,
how can we remove emojis from the data?
the code you shared is taking care of emoticons and all but emoji is a problem , i am only doing hash search because of this shortcoming.
but i am wasting a lot of data power by neglecting the other terms.
LikeLiked by 1 person
Thanks so much for this great tutorial! I’m really really new to Python and text mining. What do you mean by “pass a term as a command line argument”? Where do I put the particular search term I want to find co-occurrence in that last code? Thanks again.
LikeLike
Hi Jessica
when you run your script, you can pass additional command line arguments which are accessible in the sys.argv array (the first element of the array, sys.argv[0], contains the name of the script, the second element sys.argv[1] contains the first command line argument, etc.). The command is:
python your_script.py term_to_search
so in that snippet the variable sys.argv[1] takes the value term_to_search
Cheers,
Marco
LikeLike
Hello Marco,
I’m having the problem that my results are showing as unicode strings and I dont know how to fix it.
Example:
[((u’\ud83d’, u’\ude02′), 5269), ((u’de’, u’que’), 4210), ((u’de’, u’la’), 4208), ((u’de’, u’\xf3′), 3307), ((u’de’, u’\xed’), 2825)]
LikeLike
Hi Hector,
the tokenizer described based on regular expressions described in the earlier articles is fairly simple and designed mainly for English. You can check out some tools from the NLTK like word_tokenize and TweetTokenizer, also have a look at this discussion about unicode encoding issues: https://github.com/nltk/nltk/issues/1155
Cheers
Marco
LikeLike
Hi Marco,
very nice tutorial and thanks for sharing it! I have the same problem of Hector for french text and I didn’t figure out how to solve it.
@Hector did you find a solution?
Salvatore
LikeLike
I have also met this problem in processing Chinese and I don’t how to fix it.
Example:
[(u’\u5c0f\u660e’, defaultdict(int, {u’\u6bd5\u4e1a’: 1})),
(1, defaultdict(int, {})),
(u’\u6bd5\u4e1a’, defaultdict(int, {u’\u4e8e’: 1})),
(u’\u5728′, defaultdict(int, {u’\u65e5\u672c\u4eac\u90fd\u5927\u5b66′: 1})),
(u’\uff0c’, defaultdict(int, {u’\u540e’: 1})),
(u’\u4e8e’, defaultdict(int, {u’\u4e2d\u56fd\u79d1\u5b66\u9662′: 1})),
(u’\u65e5\u672c\u4eac\u90fd\u5927\u5b66′,
defaultdict(int, {u’\u6df1\u9020′: 1})),
(u’\u4e2d\u56fd\u79d1\u5b66\u9662′,
defaultdict(int, {u’\u8ba1\u7b97\u6240′: 1})),
(u’\u540e’, defaultdict(int, {u’\u5728′: 1})),
(u’\u8ba1\u7b97\u6240′, defaultdict(int, {u’\uff0c’: 1}))]
LikeLike
Hi libin,
for Chinese you’ll probably need a specialised library for tokenisation. For example check out https://github.com/fxsjy/jieba (I haven’t worked with Chinese so I can’t give you more details atm)
Cheers,
Marco
LikeLike
Hi, Marco,
Thank you for your kind reply.
I have figure it out through “encode(‘utf-8’)” and I have got the co-occurrence bigram, I am now focus on how to build a visualized network.
Libin
LikeLike
Hey Marc, How to add suspension points symbol(…) to list of stop words. I have tried
this change
stop = stopwords.words(‘english’) + punctuation + [‘RT’, ‘via’,’…’]
but its not working. Thanks in advance.
LikeLike
Hi Marco,
I have been messing around with the above code to look for co_occurence with three words.
(this gives a bit more insight…take if you collected @RealDonaldTrump tweets, “regret voting for” or “make america great” can give insight into a negative on positive connotations to these tweets)
Can you give me any code snippets? or hints?
Thanks,
Sam
LikeLike
Hi Sam, did you have a look at trigrams (or bigrams, or n-grams)?
Cheers
Marco
LikeLike
Hi Marco,
Thanks for your great work.
About Term co-occurrences, why don’t you use bigrams and collections from previous post?
Iterating for each word combination seems to waste lots of resources?
Best wishes,
Yang
LikeLike
Hi Yang, that’s also an option but it would limit the co-occurrences to words that you see next to each other rather than in the same tweet (context) as a whole, which is what we want to capture here.
Cheers
Marco
LikeLike
Great tutorial. The code to caculate co-coccurence is too slow for large dataset, is there any improvement for such caculations? Thanks!
LikeLike