Data Mining · NLP · Python

Mining Twitter Data with Python (Part 3: Term Frequencies)

This is the third part in a series of articles about data mining on Twitter. After collecting data and pre-processing some text, we are ready for some basic analysis. In this article, we’ll discuss the analysis of term frequencies to extract meaningful terms from our tweets.

Tutorial Table of Contents:

Counting Terms

Assuming we have collected a list of tweets (see Part 1 of the tutorial), the first exploratory analysis that we can perform is a simple word count. In this way, we can observe what are the terms most commonly used in the data set. In this example, I’ll use the set of my tweets, so the most frequent words should correspond to the topics I discuss (not necessarily, but bear with be for a couple of paragraphs).

We can use a custom tokeniser to split the tweets into a list of terms. The following code uses the preprocess() function described in Part 2 of the tutorial, in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions, emoticons and URLs. In order to keep track of the frequencies while we are processing the tweets, we can use collections.Counter() which internally is a dictionary (term: count) with some useful methods like most_common():

import operator 
import json
from collections import Counter

fname = 'mytweets.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        terms_all = [term for term in preprocess(tweet['text'])]
        # Update the counter
        count_all.update(terms_all)
    # Print the first 5 most frequent words
    print(count_all.most_common(5))

The above code will produce some unimpressive results:

[(':', 44), ('rt', 26), ('to', 26), ('and', 25), ('on', 22)]

As you can see, the most frequent words (or should I say, tokens), are not exactly meaningful.

Removing stop-words

In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called stop-words. In the example above, we can see three common stop-words – to, and and on. Stop-word removal is one important step that should be considered during the pre-processing stages. One can build a custom list of stop-words, or use available lists (e.g. NLTK provides a simple list for English stop-words).

Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.

from nltk.corpus import stopwords
import string

punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']

We can now substitute the variable terms_all in the first example with something like:

terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]

After counting, sorting the terms and printing the top 5, this is the result:

[('python', 11), ('@miguelmalvarez', 9), ('#python', 9), ('data', 8), ('@danielasfregola', 7)]

So apparently I mostly tweet about Python and data, and the users I re-tweet more often are @miguelmalvarez and @danielasfregola, it sounds about right.

More term filters

Besides stop-word removal, we can further customise the list of terms/tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:

# Count terms only once, equivalent to Document Frequency
terms_single = set(terms_all)
# Count hashtags only
terms_hash = [term for term in preprocess(tweet['text']) 
              if term.startswith('#')]
# Count terms only (no hashtags, no mentions)
terms_only = [term for term in preprocess(tweet['text']) 
              if term not in stop and 
              not term.startswith(('#', '@'))] 
              # mind the ((double brackets))
              # startswith() takes a tuple (not a list) if 
              # we pass a list of inputs

After counting and sorting, these are my most commonly used hashtags:

[('#python', 9), ('#scala', 6), ('#nosql', 4), ('#bigdata', 3), ('#nlp', 3)]

and these are my most commonly used terms:

[('python', 11), ('data', 8), ('summarisation', 6), ('twitter', 5), ('nice', 5)]

“nice”?

While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. bigrams).

from nltk import bigrams 

terms_bigram = bigrams(terms_stop)

The bigrams() function from NLTK will take a list of tokens and produce a list of tuples using adjacent tokens. Notice that we could use terms_all to compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of n tokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.

So after counting and sorting the bigrams, this is the result:

[(('nice', 'article'), 4), (('extractive', 'summarisation'), 4), (('summarisation', 'sentence'), 3), (('short', 'paper'), 3), (('paper', 'extractive'), 2)]

So apparently I tweet about nice articles (I wouldn’t bother sharing the boring ones) and extractive summarisation (the topic of my PhD dissertation). This also sounds about right.

Summary

This article has built on top of the previous ones to discuss some basis for extracting interesting terms from a data set of tweets, by using simple term frequencies, stop-word removal and n-grams. While these approaches are extremely simple to implement, they are quite useful to have a bird’s eye view on the data. We have used some components of NLTK (introduced in a previous article), so we don’t have to re-invent the wheel.

@MarcoBonzanini

Tutorial Table of Contents:

37 thoughts on “Mining Twitter Data with Python (Part 3: Term Frequencies)

  1. Hi. Thanks for this wonderful guide :-) and hello from Denmark.

    I have gone through installation of 32-bit Anaconda and I use iPython and I AM very noobish, since I get an error when I input this code:

    import operator
    import json
    from collections import Counter

    fname = ’26-Oktober-data.json’
    with open(fname, ‘r’) as f:
    count_all = Counter()
    for line in f:
    tweet = json.loads(line)
    # Create a list with all the terms
    terms_all = [term for term in preprocess(tweet[‘text’])]
    # Update the counter
    count_all.update(terms_all)
    # Print the first 5 most frequent words
    print(count_all.most_common(5))

    TypeError: ‘None-type’ object is not iterable

    I have got a json-file named 26-Oktober-data.json with about 14000 tweets that I would like to manipulate, but I keep getting this error. I actually DID manage to get it to work the first time I tried it (Python told me which tokens were used the most), but when I made a repetition an hour later, to learn it again, I got the above error.

    What am I missing here? :-)

    Like

    1. Hi Jeppe, apologies for the late reply. I’m trying to wrap up the comments on both articles. I’ve been through your latest example with vwdata.json and you’re using the preprocess() function twice. This is a sample code that wraps everything together: https://gist.github.com/bonzanini/3fdc080258fc53bcd3fa
      Assuming the data are in JSON Lines (one JSON doc per line, no empty lines), it works fine. Please see if this one solves the previous issues
      Cheers,
      Marco

      Like

  2. import json

    import string

    fname = open(’26-Oktober-data.json’)
    counts = dict()

    for line in fname:
    line = line.translate(None, string.punctuation)
    line = line.lower()
    words = line.split()
    for word in words:
    if word not in counts:
    counts[word] = 1
    else:
    counts[word] +=1

    lst = list()
    for key, val in counts.items():
    lst.append((val,key))

    lst.sort(reverse=True)
    for key, val in lst[:10]:
    print key, val

    ## this variant work like a charm

    Liked by 1 person

    1. Hi bfrost888, sorry for the late reply.
      The “track” parameter that you use with the streaming API is not case sensitive. What you can do is simply to process your tweets after you’ve downloaded them (without lowercasing/normalisation, so you keep the original casing)

      Like

  3. Hey great Post! But when im using this I get an error while using my data:

    line 44, in
    terms_all = [term for term in preprocess(tweet[‘text’])]
    KeyError: ‘text’

    Any thoughts on what I am doing wrong?

    Best,

    Like

    1. Hi Pydwon,
      Just make sure the file you’re trying to process is in JSON Lines format (each line is a JSON document, no empty lines). Occasionally, for a variety of reasons, the response from the streaming API is not a tweet but an error message, which is still a json document, so the whole file is still in the valid JSON Lines format. In this case, you’d need to either filter out the non-tweet lines from the file, or to check whether the tweet dictionary has a text key (I’d go with a clean dataset)

      Cheers,
      Marco

      Like

      1. Hey Marco,

        first off, thank you for the quick reply!
        I think you might be right with the error response from the API. I collected a whole day, hence the 4 Gb file size which makes an error message very likely. What do you think would be the best and fastest method to clean my JSON up? Opening it in Excel or any text editor?

        Best,

        Like

      2. I’d briefly go with cat/head/grep e.g.
        head -100000 yourdata.jsonl | grep -v “text”
        just to see what the error messages look like. With a few lines of python you can then iterate over the file and discard the lines without “text” (assuming only genuine tweets have the key “text”)
        Cheers,
        Marco

        Like

  4. Oi, Marco.
    I had some problems with the “d0_something_else”, more reading the comments, I saw that only need to define the function. And I decided to set it with this code, Suggested for you.

    fname = ‘python.json’
    with open(fname, ‘r’) as f:
    count_all = Counter()
    for line in f:
    tweet = json.loads(line)
    tokens = preprocess(tweet[‘text’])
    count_all.update(tokens)
    print(count_all.most_common(5))

    However, I’m having troubles. These are the errors:

    Traceback (most recent call last):
    File “C:\Users\USER\workspace\TesteApi\Teste.py”, line 128, in
    do_something_else(tokens)
    File “C:\Users\USER\workspace\TesteApi\Teste.py”, line 119, in do_something_else
    tweet = json.loads(line)
    File “C:\Users\USER\AppData\Local\Programs\Python\Python35\lib\json\__init__.py”, line 319, in loads
    return _default_decoder.decode(s)
    File “C:\Users\USER\AppData\Local\Programs\Python\Python35\lib\json\decoder.py”, line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    File “C:\Users\USER\AppData\Local\Programs\Python\Python35\lib\json\decoder.py”, line 357, in raw_decode
    raise JSONDecodeError(“Expecting value”, s, err.value) from None
    json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

    Please, can you help me ?
    *Sorry about my English*

    Like

  5. Hi

    Thanks for the great example. I am getting the following error. Can you please help me with this?

    Traceback (most recent call last):
    File “F:\Studies\MS in CS\Project\init_anal_main.py”, line 11, in
    terms_all = [term for term in preprocess(tweet[‘text’])]
    NameError: name ‘preprocess’ is not defined

    Like

    1. Hi Bala,
      the error occurs because the function preprocess() is not defined in your code. You need to include the function definitions from part 2 of the tutorial (i.e. the block with preprocess(), tokenize() and related regular expressions)
      Cheers,
      Marco

      Like

  6. Hi!

    Great example, but I have a slight problem.
    The tokenizer works great with english language, but it starts breaking with unicode characters.

    For example:

    Barça, que más veces ha jugado contra 10 en la historia https://t.co/7WUjZrMJah #UCL

    tokenizes to:

    Bar
    ç
    a
    ,
    que
    m
    á
    s
    veces
    ha
    jugado
    contra
    10
    en
    la
    historia
    https://t.co/7WUjZrMJah
    #UCL

    I need ‘Barça’ instead of ‘Bar’, ‘ç’, ‘a’ ….

    The default tokenizer takes care of this but breaks on hashtag,mentions and links, as you mentioned.

    Any ways I can get both to work?

    Help appreciated! :)

    Like

    1. Hi Krishanu,
      the TweetTokenizer in NLTK goes one step closer to what you need, but it still breaks on occasions, e.g.

      >>> from nltk.tokenize import TweetTokenizer
      >>> tweet = "Barça, que más veces ha jugado contra 10 en la historia https://t.co/7WUjZrMJah #UCL"
      >>> tokenizer = TweetTokenizer()
      >>> tokens = tokenizer.tokenize(tweet)
      >>> tokens
      ['Bar', 'ça', ',', 'que', 'más', 'veces', 'ha', 'jugado', 'contra', '10', 'en', 'la', 'historia', 'https://t.co/7WUjZrMJah', '#UCL']
      

      I’d take a look at their source to see if it can be extended/simplified easily.
      Cheers,
      Marco

      Liked by 1 person

  7. Hi i tried running your code but I get the error:

    Traceback (most recent call last):
    File “D:\WinPython-32bit-2.7.10.3\python-2.7.10\twitter_most_common_words.py”, line 58, in
    tokens = preprocess(tweet[‘text’])
    KeyError: ‘text’

    I have no idea what went wrong though.

    import sys
    import json
    from collections import Counter
    import re
    from nltk.corpus import stopwords
    import string

    punctuation = list(string.punctuation)
    stop = stopwords.words(‘english’) + punctuation + [‘rt’, ‘via’]

    emoticons_str = r”””
    (?:
    [:=;] # Eyes
    [oO\-]? # Nose (optional)
    [D\)\]\(\]/\\OpP] # Mouth
    )”””

    regex_str = [
    emoticons_str,
    r’]+>’, # HTML tags
    r'(?:@[\w_]+)’, # @-mentions
    r”(?:\#+[\w_]+[\w\’_\-]*[\w_]+)”, # hash-tags
    r’http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+’, # URLs

    r'(?:(?:\d+,?)+(?:\.?\d+)?)’, # numbers
    r”(?:[a-z][a-z’\-_]+[a-z])”, # words with – and ‘
    r'(?:[\w_]+)’, # other words
    r'(?:\S)’ # anything else
    ]

    tokens_re = re.compile(r'(‘+’|’.join(regex_str)+’)’, re.VERBOSE | re.IGNORECASE)
    emoticon_re = re.compile(r’^’+emoticons_str+’$’, re.VERBOSE | re.IGNORECASE)

    def tokenize(s):
    return tokens_re.findall(s)

    def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
    tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

    with open(‘Data 20k.json’, ‘r’) as f:
    count_all = Counter()
    for line in f:
    tweet = json.loads(line)
    j = json.loads(line)
    tokens = preprocess(tweet[‘text’])
    count_all.update(tokens)
    print(count_all.most_common(5))

    I used everything except the last part where I made some minor changes:

    with open(‘Data 20k.json’, ‘r’) as f:
    count_all = Counter()
    for line in f:
    tweet = json.loads(line)
    j = json.loads(line)
    tokens = preprocess(tweet[‘text’])
    count_all.update(tokens)
    print(count_all.most_common(5))

    Please help me . Thank you

    Like

    1. Hi Jeremy, probably one of the lines in your data file are not correct tweets. Make sure you correctly have one tweet per line, no empty lines, and if there are errors (e.g. network bumps from twitter) you can remove them. In the meanwhile the workaround could be:

      with open('Data 20k.json', 'r') as f:
          count_all = Counter()
          for line in f:
              tweet = json.loads(line)
              if tweet.get('text')
                  tokens = preprocess(tweet['text'])
                  count_all.update(tokens)
          print(count_all.most_common(5))
      

      Cheers,
      Marco

      Like

      1. Hi Marco ,

        Thanks for the quick reply, I have check my JSON File i noticed there is a blank line between and I have removed it . However I tried running the program again and I get this error instead:

        ValueError: Extra Data: line 1 column 4488 – line 1 column 99678411 (char 4487 -99678410)

        Any idea?

        Thanks

        Like

  8. Just to add the error message :

    tweet=json.loads(line)
    return_default_decoder.decode(s)
    raise ValueError(errmsg(“Extra data”, s,end,len(s)))

    Like

  9. Hi, thanks so much for this, it’s most helpful. I do have a problem, though. I’m working with a large data file, 80 000 tweets or so, which is in an SQL database. I have a csv of the text of the tweets only, which I have managed to convert to Json and parse using the code on this page and the previous two. I am not a coder, but I can copy-paste and understand basic syntax.

    I am trying to extract hashtags, and your script only finds two hashtags in the whole corpus, both of them present, but this is by no means all of them. I am reasonably familiar with the tags in the archive already, since that is part of the rationale for harvesting them. In addition, the count for each tag (and for each term, when I just do terms) is 78289, which is the number of records in the archive.

    I’m at a loss, do you have any suggestions as to how I would go about debugging this?

    Like

    1. Okay, I’ve made some progress. It seems to be seeing only the final tweet in the list, and counting all of those terms 78289 times. I can view the Json file, and the tweets all seem to be there (I don’t have a file with 78289 copies of the same tweet, at least), and when I view the terms (print(tokens) from the code to tokenize) I can see more terms than the nine in the last tweet. Something about the counter isn’t working?

      Like

  10. Great, now I’m embarrassed. Problem solved. I was trying to be clever and reuse “tokens” from elsewhere in teh code instead of pre-processing again. Sorry to have bothered you. Thanks again.

    Like

  11. So in my collection of tweets the token ‘\u2026’ comes up a lot and it is apparently unicode for ellipses ‘…’. When I try to get rid of it by adding it to the list stop (stop=stopwords.words(‘english’)+punctuation+[‘rt’, ‘RT’, ‘via’, ‘\u2026’]
    ) it still appears, any ideas how to fix this?

    Liked by 1 person

    1. I just read the next section where you talk specifically about the unicode characters. It’s easy enough for me just to ignore it.

      Like

    1. Hi Neil, you could look into n-grams (phrases of length n). At the end of the article I’ve introduced bigrams (n-grams of length 2) and the concept is very similar. NLTK also offers the nltk.ngrams() function, e.g.

      from nltk import ngrams
      text = "some long text here blah blah"
      print(ngrams(text, 4)) # this will print the 4-grams from text
      

      Given the title you’re looking for, you split it on whitespaces so you know the “n” you need.
      Cheers,
      Marco

      Like

  12. Sorry to bug you again. Your code is really helpful and finally helping me learn the concepts! Yesterday I ran this code and it worked:

    import operator
    import json
    from collections import Counter
    fname = ‘C:\Users\Public\Documents\Python Scripts\Clinton.json’
    with open(fname, ‘r’) as f:
    count_all = Counter()
    for line in f:
    tweet = json.loads(line)

    # Create a list with all the hashtags
    terms_hash = [term for term in preprocess(tweet[‘text’])
    if term.startswith(‘#’)]

    # Update the counter
    count_all.update(terms_hash)

    # Print the first 20 most frequent hashtags

    print(count_all.most_common(20))

    However, now I get an error: list indices must be integers, not str for the terms_hash line. No idea what happened or how to fix. :-(

    Like

    1. Hi
      check the format of your json file: each line must be a valid json document, no empty line, i.e. I suspect you have the error when the “tweet” variable is not a dictionary as expected. You could either clean the file manually or wrap your code in a try/except block to capture the error.
      Cheers,
      Marco

      Like

  13. Hi!

    I keept getting Unicode characters in my frequency analysis results..
    eg.
    [(u’science’, 15), (u’\u2026′, 7), (u’2016′, 2), (u’#biology’, 2), (u’computer’, 2), (u’water’, 2), (u’12’, 2), (u’1′, 2), (u’gonna’, 2), (u’code’, 1)]

    How do I stop getting the \u2026??

    cheers

    Like

    1. Hi Benji, that particular unicode symbol is the ellipsis character, so I suggest you simply add it to the list of stop-words. In the next article (part 4) there’s a brief mention about it, also in the comments. From time to time you’ll have to refine your stop-word list anyway depending on your data.
      Cheers,
      Marco

      Like

      1. Hi Marco – I’ve tried to add it to the stop list but nothing works. I’ve tried adding it as ‘\u2026’ and as ‘…’ but it still shows up in the output.
        Is there something im missing?

        currently it is:

        stop = stopwords.words(‘english’) + punctuation +[‘rt’, ‘via’, ‘\u2026’]

        cheers

        Like

      2. Hi Benji, sorry for the late reply. Try with u’\u2026′ so you have the correct unicode string rather than a str (I’m assuming you’re using Python 2? all the code was tested with 3.4+ so unicode-vs-str is the main source of hiccups)
        Cheers,
        Marco

        Like

  14. Hey if you split using punctuations, the colons, fullstops and slashes in URLs will be removed too. It shouldn’t happen.

    Like

    1. Hi Omkar
      that’s correct, in fact we don’t split using punctuation, we have a custom regex for tokenisation. See Part 2 of this series for the details.
      Cheers
      Marco

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s