Data Mining · NLP · Python

Mining Twitter Data with Python (Part 4: Rugby and Term Co-occurrences)

Last Saturday was the closing day of the Six Nations Championship, an annual international rugby competition. Before turning on the TV to watch Italy being trashed by Wales, I decided to use this event to collect some data from Twitter and perform some exploratory text analysis on something more interesting than the small list of my tweets.

This article continues the tutorial on Twitter Data Mining, re-using what we discussed in the previous articles with some more realistic data. It also expands the analysis by introducing the concept of term co-occurrence.

Tutorial Table of Contents:

The Application Domain

As the name suggests, six teams are involved in the competition: England, Ireland, Wales, Scotland, France and Italy. This means that we can expect the event to be tweeted in multiple languages (English, French, Italian, Welsh, Gaelic, possibly other languages as well), with English being the major language. Assuming the team names will be mentioned frequently, we could decide to look also for their nicknames, e.g. Les Bleus for France or Azzurri for Italy. During the last day of the competition, three matches are played sequentially. Three teams in particular had a shot for the title: England, Ireland and Wales. At the end, Ireland won the competition but everything was open until the very last minute.

Setting Up

I used the streaming API to download all the tweets containing the string #rbs6nations during the day. Obviously not all the tweets about the event contained the hashtag, but this is a good baseline. The time frame for the download was from around 12:15PM to 7:15PM GMT, that is from about 15 minutes before the first match, to about 15 minutes after the last match was over. At the end, more than 18,000 tweets have been downloaded in JSON format, making for about 75Mb of data. This should be small enough to quickly do some processing in memory, and at the same time big enough to observe something possibly interesting.

The textual content of the tweets has been pre-processed with tokenisation and lowercasing using the preprocess() function introduced in Part 2 of the tutorial.

Interesting terms and hashtags

Following what we discussed in Part 3 (Term Frequencies), we want to observe the most common terms and hashtags used during day. If you have followed the discussion about creating different lists of tokens in order to capture terms without hashtags, hashtags only, removing stop-words, etc. you can play around with the different lists.

This is the unsurprising list of top 10 most frequent terms (terms_only in Part 3) in the data set.

[('ireland', 3163), ('england', 2584), ('wales', 2271), ('ā€¦', 2068), ('day', 1479), ('france', 1380), ('win', 1338), ('rugby', 1253), ('points', 1221), ('title', 1180)]

The first three terms correspond to the teams who had a go for the title. The frequencies also respect the order in the final table. The fourth term is instead a punctuation mark that we missed and didn’t include in the list of stop-words. This is because string.punctuation only contains ASCII symbols, while here we’re dealing with a unicode character. If we dig into the data, there will be more examples like this, but for the moment we don’t worry about it.

After adding the suspension-points symbol to the list of stop-words, we have a new entry at the end of the list:

[('ireland', 3163), ('england', 2584), ('wales', 2271), ('day', 1479), ('france', 1380), ('win', 1338), ('rugby', 1253), ('points', 1221), ('title', 1180), ('šŸ€', 1154)]

Interestingly, a new token we didn’t account for, an Emoji symbol (in this case, the Irish Shamrock).

If we have a look at the most common hashtags, we need to consider that #rbs6nations will be by far the most common token (that’s our search term for downloading the tweets), so we can exclude it from the list. This leave us with:

[('#engvfra', 1701), ('#itavwal', 927), ('#rugby', 880), ('#scovire', 692), ('#ireland', 686), ('#angfra', 554), ('#xvdefrance', 508), ('#crunch', 500), ('#wales', 446), ('#england', 406)]

We can observe that the most common hashtags, a part from #rugby, are related to the individual matches. In particular England v France has received the highest number of mentions, probably being the last match of the day with a dramatic finale. Something interesting to notice is that a fair amount of tweets also contained terms in French: the count for #angfra should in fact be added to #engvfra. Those unfamiliar with rugby probably wouldn’t recognise that also #crunch should be included with #EngvFra match, as Le Crunch is the traditional name for this event. So by far, the last match has received a lot of attention.

Term co-occurrences

Sometimes we are interested in the terms that occur together. This is mainly because the context gives us a better insight about the meaning of a term, supporting applications such as word disambiguation or semantic similarity. We discussed the option of using bigrams in the previous article, but we want to extend the context of a term to the whole tweet.

We can refactor the code from the previous article in order to capture the co-occurrences. We build a co-occurrence matrix com such that com[x][y] contains the number of times the term x has been seen in the same tweet as the term y:

from collections import defaultdict
# remember to include the other import from the previous post

com = defaultdict(lambda : defaultdict(int))

# f is the file pointer to the JSON data set
for line in f: 
    tweet = json.loads(line)
    terms_only = [term for term in preprocess(tweet['text']) 
                  if term not in stop 
                  and not term.startswith(('#', '@'))]

    # Build co-occurrence matrix
    for i in range(len(terms_only)-1):            
        for j in range(i+1, len(terms_only)):
            w1, w2 = sorted([terms_only[i], terms_only[j]])                
            if w1 != w2:
                com[w1][w2] += 1

While building the co-occurrence matrix, we don’t want to count the same term pair twice, e.g. com[A][B] == com[B][A], so the inner for loop starts from i+1 in order to build a triangular matrix, while sorted will preserve the alphabetical order of the terms.

For each term, we then extract the 5 most frequent co-occurrent terms, creating a list of tuples in the form ((term1, term2), count):

com_max = []
# For each term, look for the most common co-occurrent terms
for t1 in com:
    t1_max_terms = sorted(com[t1].items(), key=operator.itemgetter(1), reverse=True)[:5]
    for t2, t2_count in t1_max_terms:
        com_max.append(((t1, t2), t2_count))
# Get the most frequent co-occurrences
terms_max = sorted(com_max, key=operator.itemgetter(1), reverse=True)
print(terms_max[:5])

The results:

[(('6', 'nations'), 845), (('champions', 'ireland'), 760), (('nations', 'rbs'), 742), (('day', 'ireland'), 731), (('ireland', 'wales'), 674)]

This implementation is pretty straightforward, but depending on the data set and on the use of the matrix, one might want to look into tools like scipy.sparse for building a sparse matrix.

We could also look for a specific term and extract its most frequent co-occurrences. We simply need to modify the main loop including an extra counter, for example:

search_word = sys.argv[1] # pass a term as a command-line argument
count_search = Counter()
for line in f:
    tweet = json.loads(line)
    terms_only = [term for term in preprocess(tweet['text']) 
                  if term not in stop 
                  and not term.startswith(('#', '@'))]
    if search_word in terms_only:
        count_search.update(terms_only)
print("Co-occurrence for %s:" % search_word)
print(count_search.most_common(20))

The outcome for “ireland”:

[('champions', 756), ('day', 727), ('nations', 659), ('wales', 654), ('2015', 638), ('6', 613), ('rbs', 585), ('http://t.co/y0nvsvayln', 559), ('šŸ€', 526), ('10', 522), ('win', 377), ('england', 377), ('twickenham', 361), ('40', 360), ('points', 356), ('sco', 355), ('ire', 355), ('title', 346), ('scotland', 301), ('turn', 295)]

The outcome for “rugby”:

[('day', 476), ('game', 160), ('ireland', 143), ('england', 132), ('great', 105), ('today', 104), ('best', 97), ('well', 90), ('ever', 89), ('incredible', 87), ('amazing', 84), ('done', 82), ('amp', 71), ('games', 66), ('points', 64), ('monumental', 58), ('strap', 56), ('world', 55), ('team', 55), ('http://t.co/bhmeorr19i', 53)]

Overall, quite interesting.

Summary

This article has discussed a toy example of Text Mining on Twitter, using some realistic data taken during a sport event. Using what we have learnt in the previous episodes, we have downloaded some data using the streaming API, pre-processed the data in JSON format and extracted some interesting terms and hashtags from the tweets. The article has also introduced the concept of term co-occurrence, shown how to build a co-occurrence matrix and discussed how to use it to find some interesting insight.

@MarcoBonzanini

22 thoughts on “Mining Twitter Data with Python (Part 4: Rugby and Term Co-occurrences)

  1. Hello, Marcos, for some reason I’m getting this error:

    Traceback (most recent call last):
    File “C:\Users\Afonso\Documents\Workspaces\Workspace JAVA\Mineracao\Projeto.py”, line 207, in
    tweet = json.loads(line)
    File “C:\Users\Afonso\AppData\Local\Programs\Python\Python35\lib\json\__init__.py”, line 319, in loads
    return _default_decoder.decode(s)
    File “C:\Users\Afonso\AppData\Local\Programs\Python\Python35\lib\json\decoder.py”, line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    File “C:\Users\Afonso\AppData\Local\Programs\Python\Python35\lib\json\decoder.py”, line 357, in raw_decode
    raise JSONDecodeError(“Expecting value”, s, err.value) from None
    json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

    My archive is working well on all the other codes, but in this part of the Article it complain about the json.line(), please help!

    Like

  2. Hello Marco!

    Can you explain what this line of code does?

    t1_max_terms = max(com[t1].items(), key=operator.itemgetter(1))[:5]

    Doesn’t max() return a single maximum value?? Why have you used [0:5] at the end?? Are you actully sorting the dict using itemgetter(1) ?? All in all, what exactly is t1_max_terms??

    Thanks!

    Like

    1. Hi, you are correct, max() returns a single value. In fact, that was some copy&paste mistake of mine from previous experiments, thanks for spotting it. The correct function to use is sorted(), so t1_max_terms is the list of (five) terms with the highest co-occurrence frequency for the term t1. I’ve updated the snippet accordingly
      Best regards,
      Marco

      Like

  3. this line
    print("Co-occurrence for %s:" % search_word)
    is giving me a syntax error(invalid syntax) using python 2.7 and i am unable to sort it out , any help is appreciated

    Like

    1. Hi,
      wordpress from time to time decides to do encode html entities without asking :)
      I’ve fixed the line in the article, thanks for spotting it:

      print("Co-occurrence for %s:" % search_word)
      

      Side comment, the code is developed and tested for Python 3.4/3.5 so some incompatibilities with Python 2 might come up occasionally.
      Cheers,
      Marco

      Like

      1. thank you so much, another question came up while analyzing these codes,
        how can we remove emojis from the data?
        the code you shared is taking care of emoticons and all but emoji is a problem , i am only doing hash search because of this shortcoming.
        but i am wasting a lot of data power by neglecting the other terms.

        Liked by 1 person

  4. Thanks so much for this great tutorial! I’m really really new to Python and text mining. What do you mean by “pass a term as a command line argument”? Where do I put the particular search term I want to find co-occurrence in that last code? Thanks again.

    Like

    1. Hi Jessica
      when you run your script, you can pass additional command line arguments which are accessible in the sys.argv array (the first element of the array, sys.argv[0], contains the name of the script, the second element sys.argv[1] contains the first command line argument, etc.). The command is:
      python your_script.py term_to_search
      so in that snippet the variable sys.argv[1] takes the value term_to_search

      Cheers,
      Marco

      Like

  5. Hello Marco,

    I’m having the problem that my results are showing as unicode strings and I dont know how to fix it.

    Example:

    [((u’\ud83d’, u’\ude02′), 5269), ((u’de’, u’que’), 4210), ((u’de’, u’la’), 4208), ((u’de’, u’\xf3′), 3307), ((u’de’, u’\xed’), 2825)]

    Like

    1. Hi Hector,
      the tokenizer described based on regular expressions described in the earlier articles is fairly simple and designed mainly for English. You can check out some tools from the NLTK like word_tokenize and TweetTokenizer, also have a look at this discussion about unicode encoding issues: https://github.com/nltk/nltk/issues/1155

      Cheers
      Marco

      Like

    2. I have also met this problem in processing Chinese and I don’t how to fix it.

      Example:

      [(u’\u5c0f\u660e’, defaultdict(int, {u’\u6bd5\u4e1a’: 1})),
      (1, defaultdict(int, {})),
      (u’\u6bd5\u4e1a’, defaultdict(int, {u’\u4e8e’: 1})),
      (u’\u5728′, defaultdict(int, {u’\u65e5\u672c\u4eac\u90fd\u5927\u5b66′: 1})),
      (u’\uff0c’, defaultdict(int, {u’\u540e’: 1})),
      (u’\u4e8e’, defaultdict(int, {u’\u4e2d\u56fd\u79d1\u5b66\u9662′: 1})),
      (u’\u65e5\u672c\u4eac\u90fd\u5927\u5b66′,
      defaultdict(int, {u’\u6df1\u9020′: 1})),
      (u’\u4e2d\u56fd\u79d1\u5b66\u9662′,
      defaultdict(int, {u’\u8ba1\u7b97\u6240′: 1})),
      (u’\u540e’, defaultdict(int, {u’\u5728′: 1})),
      (u’\u8ba1\u7b97\u6240′, defaultdict(int, {u’\uff0c’: 1}))]

      Like

      1. Hi, Marco,
        Thank you for your kind reply.
        I have figure it out through “encode(‘utf-8’)” and I have got the co-occurrence bigram, I am now focus on how to build a visualized network.
        Libin

        Like

  6. Hey Marc, How to add suspension points symbol(…) to list of stop words. I have tried
    this change
    stop = stopwords.words(‘english’) + punctuation + [‘RT’, ‘via’,’…’]

    but its not working. Thanks in advance.

    Like

  7. Hi Marco,
    I have been messing around with the above code to look for co_occurence with three words.

    (this gives a bit more insight…take if you collected @RealDonaldTrump tweets, “regret voting for” or “make america great” can give insight into a negative on positive connotations to these tweets)

    Can you give me any code snippets? or hints?

    Thanks,
    Sam

    Like

  8. Hi Marco,

    Thanks for your great work.

    About Term co-occurrences, why don’t you use bigrams and collections from previous post?
    Iterating for each word combination seems to waste lots of resources?

    Best wishes,
    Yang

    Like

    1. Hi Yang, that’s also an option but it would limit the co-occurrences to words that you see next to each other rather than in the same tweet (context) as a whole, which is what we want to capture here.

      Cheers
      Marco

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s