Mining Twitter Data with Python (Part 3: Term Frequencies)

This is the third part in a series of articles about data mining on Twitter. After collecting data and pre-processing some text, we are ready for some basic analysis. In this article, we’ll discuss the analysis of term frequencies to extract meaningful terms from our tweets.

Tutorial Table of Contents:

Part 1: Collecting data
Part 2: Text Pre-processing
Part 3: Term Frequencies (this article)
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

Counting Terms

Assuming we have collected a list of tweets (see Part 1 of the tutorial), the first exploratory analysis that we can perform is a simple word count. In this way, we can observe what are the terms most commonly used in the data set. In this example, I’ll use the set of my tweets, so the most frequent words should correspond to the topics I discuss (not necessarily, but bear with be for a couple of paragraphs).

We can use a custom tokeniser to split the tweets into a list of terms. The following code uses the preprocess() function described in Part 2 of the tutorial, in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions, emoticons and URLs. In order to keep track of the frequencies while we are processing the tweets, we can use collections.Counter() which internally is a dictionary (term: count) with some useful methods like most_common():

import operator 
import json
from collections import Counter

fname = 'mytweets.json'
with open(fname, 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        # Create a list with all the terms
        terms_all = [term for term in preprocess(tweet['text'])]
        # Update the counter
        count_all.update(terms_all)
    # Print the first 5 most frequent words
    print(count_all.most_common(5))

The above code will produce some unimpressive results:

[(':', 44), ('rt', 26), ('to', 26), ('and', 25), ('on', 22)]

As you can see, the most frequent words (or should I say, tokens), are not exactly meaningful.

Removing stop-words

In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called stop-words. In the example above, we can see three common stop-words – to, and and on. Stop-word removal is one important step that should be considered during the pre-processing stages. One can build a custom list of stop-words, or use available lists (e.g. NLTK provides a simple list for English stop-words).

Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.

from nltk.corpus import stopwords
import string

punctuation = list(string.punctuation)
stop = stopwords.words('english') + punctuation + ['rt', 'via']

We can now substitute the variable terms_all in the first example with something like:

terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]

After counting, sorting the terms and printing the top 5, this is the result:

[('python', 11), ('@miguelmalvarez', 9), ('#python', 9), ('data', 8), ('@danielasfregola', 7)]

So apparently I mostly tweet about Python and data, and the users I re-tweet more often are @miguelmalvarez and @danielasfregola, it sounds about right.

More term filters

Besides stop-word removal, we can further customise the list of terms/tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:

# Count terms only once, equivalent to Document Frequency
terms_single = set(terms_all)
# Count hashtags only
terms_hash = [term for term in preprocess(tweet['text']) 
              if term.startswith('#')]
# Count terms only (no hashtags, no mentions)
terms_only = [term for term in preprocess(tweet['text']) 
              if term not in stop and 
              not term.startswith(('#', '@'))] 
              # mind the ((double brackets))
              # startswith() takes a tuple (not a list) if 
              # we pass a list of inputs

After counting and sorting, these are my most commonly used hashtags:

[('#python', 9), ('#scala', 6), ('#nosql', 4), ('#bigdata', 3), ('#nlp', 3)]

and these are my most commonly used terms:

[('python', 11), ('data', 8), ('summarisation', 6), ('twitter', 5), ('nice', 5)]

“nice”?

While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. bigrams).

from nltk import bigrams 

terms_bigram = bigrams(terms_stop)

The bigrams() function from NLTK will take a list of tokens and produce a list of tuples using adjacent tokens. Notice that we could use terms_all to compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of n tokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.

So after counting and sorting the bigrams, this is the result:

[(('nice', 'article'), 4), (('extractive', 'summarisation'), 4), (('summarisation', 'sentence'), 3), (('short', 'paper'), 3), (('paper', 'extractive'), 2)]

So apparently I tweet about nice articles (I wouldn’t bother sharing the boring ones) and extractive summarisation (the topic of my PhD dissertation). This also sounds about right.

Summary

This article has built on top of the previous ones to discuss some basis for extracting interesting terms from a data set of tweets, by using simple term frequencies, stop-word removal and n-grams. While these approaches are extremely simple to implement, they are quite useful to have a bird’s eye view on the data. We have used some components of NLTK (introduced in a previous article), so we don’t have to re-invent the wheel.

@MarcoBonzanini

Tutorial Table of Contents:

Part 1: Collecting data
Part 2: Text Pre-processing
Part 3: Term Frequencies (this article)
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

Published by

Marco

Data Scientist View all posts by Marco

44 thoughts on “Mining Twitter Data with Python (Part 3: Term Frequencies)”

Jeppe Gade says:

December 8, 2015 at 10:48 am

Hi. Thanks for this wonderful guide :-) and hello from Denmark.

I have gone through installation of 32-bit Anaconda and I use iPython and I AM very noobish, since I get an error when I input this code:

import operator
import json
from collections import Counter

fname = ’26-Oktober-data.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
# Create a list with all the terms
terms_all = [term for term in preprocess(tweet[‘text’])]
# Update the counter
count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(5))

TypeError: ‘None-type’ object is not iterable

I have got a json-file named 26-Oktober-data.json with about 14000 tweets that I would like to manipulate, but I keep getting this error. I actually DID manage to get it to work the first time I tried it (Python told me which tokens were used the most), but when I made a repetition an hour later, to learn it again, I got the above error.

What am I missing here? :-)

LikeLike

Reply
1. Marco says:
  
  December 29, 2015 at 3:02 pm
  
  Hi Jeppe, apologies for the late reply. I’m trying to wrap up the comments on both articles. I’ve been through your latest example with vwdata.json and you’re using the preprocess() function twice. This is a sample code that wraps everything together: https://gist.github.com/bonzanini/3fdc080258fc53bcd3fa
  Assuming the data are in JSON Lines (one JSON doc per line, no empty lines), it works fine. Please see if this one solves the previous issues
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Aikhaal Mikhailov (@Aikhail) says:

December 29, 2015 at 3:17 pm

import json

import string

fname = open(’26-Oktober-data.json’)
counts = dict()

for line in fname:
line = line.translate(None, string.punctuation)
line = line.lower()
words = line.split()
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] +=1

lst = list()
for key, val in counts.items():
lst.append((val,key))

lst.sort(reverse=True)
for key, val in lst[:10]:
print key, val

## this variant work like a charm

LikeLiked by 1 person

Reply
bfrost888 says:

January 7, 2016 at 8:59 pm

Great post, thank you. Is there way to add capitalization to the filters? So that #ThankYou and #thankyou are not treated as two different strings?

LikeLike

Reply
1. Marco says:
  
  January 22, 2016 at 7:59 am
  
  Hi bfrost888, sorry for the late reply.
  The “track” parameter that you use with the streaming API is not case sensitive. What you can do is simply to process your tweets after you’ve downloaded them (without lowercasing/normalisation, so you keep the original casing)
  
  LikeLike
  
  Reply
Pydwon says:

January 21, 2016 at 11:08 pm

Hey great Post! But when im using this I get an error while using my data:

line 44, in
terms_all = [term for term in preprocess(tweet[‘text’])]
KeyError: ‘text’

Any thoughts on what I am doing wrong?

Best,

LikeLike

Reply
1. Marco says:
  
  January 22, 2016 at 8:10 am
  
  Hi Pydwon,
  Just make sure the file you’re trying to process is in JSON Lines format (each line is a JSON document, no empty lines). Occasionally, for a variety of reasons, the response from the streaming API is not a tweet but an error message, which is still a json document, so the whole file is still in the valid JSON Lines format. In this case, you’d need to either filter out the non-tweet lines from the file, or to check whether the tweet dictionary has a text key (I’d go with a clean dataset)
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
  1. Pydwon says:
    
    January 22, 2016 at 4:18 pm
    
    Hey Marco,
    
    first off, thank you for the quick reply!
    I think you might be right with the error response from the API. I collected a whole day, hence the 4 Gb file size which makes an error message very likely. What do you think would be the best and fastest method to clean my JSON up? Opening it in Excel or any text editor?
    
    Best,
    
    LikeLike
  2. Marco says:
    
    January 23, 2016 at 10:48 am
    
    I’d briefly go with cat/head/grep e.g.
    head -100000 yourdata.jsonl | grep -v “text”
    just to see what the error messages look like. With a few lines of python you can then iterate over the file and discard the lines without “text” (assuming only genuine tweets have the key “text”)
    Cheers,
    Marco
    
    LikeLike
Paloma Bispo says:

January 27, 2016 at 4:19 am

Oi, Marco.
I had some problems with the “d0_something_else”, more reading the comments, I saw that only need to define the function. And I decided to set it with this code, Suggested for you.

fname = ‘python.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
tokens = preprocess(tweet[‘text’])
count_all.update(tokens)
print(count_all.most_common(5))

However, I’m having troubles. These are the errors:

Traceback (most recent call last):
File “C:\Users\USER\workspace\TesteApi\Teste.py”, line 128, in
do_something_else(tokens)
File “C:\Users\USER\workspace\TesteApi\Teste.py”, line 119, in do_something_else
tweet = json.loads(line)
File “C:\Users\USER\AppData\Local\Programs\Python\Python35\lib\json\__init__.py”, line 319, in loads
return _default_decoder.decode(s)
File “C:\Users\USER\AppData\Local\Programs\Python\Python35\lib\json\decoder.py”, line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “C:\Users\USER\AppData\Local\Programs\Python\Python35\lib\json\decoder.py”, line 357, in raw_decode
raise JSONDecodeError(“Expecting value”, s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

Please, can you help me ?
*Sorry about my English*

LikeLike

Reply
1. Nacho says:
  
  January 20, 2017 at 2:08 pm
  
  The same thing happens to me and I do not know how to fix it. you got it?
  
  LikeLike
  
  Reply
  1. MrBlobby says:
    
    September 19, 2017 at 2:49 pm
    
    Also have this exact same error
    
    LikeLike
2. Twin says:
  
  November 25, 2017 at 2:43 pm
  
  I had the exact same error. I had to put a nested try & except loop within the for loop to get it to work. See my code below.
  
  fname = ‘python.json’
  with open(fname, ‘r’) as f:
  count_all = Counter()
  for line in f:
  try:
  tweet = json.loads(line)
  # Create a list with all the terms
  terms_all = [term for term in preprocess(tweet[‘text’])]
  # Update the counter
  count_all.update(terms_all)
  except:
  continue
  # Print the first 5 most frequent words
  print(count_all.most_common(5))
  
  LikeLike
  
  Reply
Bala says:

February 9, 2016 at 1:39 pm

Hi

Thanks for the great example. I am getting the following error. Can you please help me with this?

Traceback (most recent call last):
File “F:\Studies\MS in CS\Project\init_anal_main.py”, line 11, in
terms_all = [term for term in preprocess(tweet[‘text’])]
NameError: name ‘preprocess’ is not defined

LikeLike

Reply
1. Marco says:
  
  February 10, 2016 at 12:17 pm
  
  Hi Bala,
  the error occurs because the function preprocess() is not defined in your code. You need to include the function definitions from part 2 of the tutorial (i.e. the block with preprocess(), tokenize() and related regular expressions)
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Krishanu says:

April 5, 2016 at 11:07 pm

Hi!

Great example, but I have a slight problem.
The tokenizer works great with english language, but it starts breaking with unicode characters.

For example:

Barça, que más veces ha jugado contra 10 en la historia https://t.co/7WUjZrMJah #UCL

tokenizes to:

Bar
ç
a
,
que
m
á
s
veces
ha
jugado
contra
10
en
la
historia
https://t.co/7WUjZrMJah
#UCL

I need ‘Barça’ instead of ‘Bar’, ‘ç’, ‘a’ ….

The default tokenizer takes care of this but breaks on hashtag,mentions and links, as you mentioned.

Any ways I can get both to work?

Help appreciated! :)

LikeLike

Reply
1. Marco says:
  
  April 6, 2016 at 6:50 am
  Hi Krishanu,
  the TweetTokenizer in NLTK goes one step closer to what you need, but it still breaks on occasions, e.g.
```
>>> from nltk.tokenize import TweetTokenizer
>>> tweet = "Barça, que más veces ha jugado contra 10 en la historia https://t.co/7WUjZrMJah #UCL"
>>> tokenizer = TweetTokenizer()
>>> tokens = tokenizer.tokenize(tweet)
>>> tokens
['Bar', 'ça', ',', 'que', 'más', 'veces', 'ha', 'jugado', 'contra', '10', 'en', 'la', 'historia', 'https://t.co/7WUjZrMJah', '#UCL']
```
  I’d take a look at their source to see if it can be extended/simplified easily.
  Cheers,
  Marco
  
  LikeLiked by 1 person
  Reply
  1. Krishanu says:
    
    April 6, 2016 at 10:40 am
    
    Thanks!
    Looking forward to your reply!
    
    LikeLike
Jeremy says:

April 11, 2016 at 4:49 am

Hi i tried running your code but I get the error:

Traceback (most recent call last):
File “D:\WinPython-32bit-2.7.10.3\python-2.7.10\twitter_most_common_words.py”, line 58, in
tokens = preprocess(tweet[‘text’])
KeyError: ‘text’

I have no idea what went wrong though.

import sys
import json
from collections import Counter
import re
from nltk.corpus import stopwords
import string

punctuation = list(string.punctuation)
stop = stopwords.words(‘english’) + punctuation + [‘rt’, ‘via’]

emoticons_str = r”””
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]$\]/\\OpP] # Mouth
)”””

regex_str = [
emoticons_str,
r’]+>’, # HTML tags
r'(?:@[\w_]+)’, # @-mentions
r”(?:\#+[\w_]+[\w\’_\-]*[\w_]+)”, # hash-tags
r’http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\($,]|(?:%[0-9a-f][0-9a-f]))+’, # URLs

r'(?:(?:\d+,?)+(?:\.?\d+)?)’, # numbers
r”(?:[a-z][a-z’\-_]+[a-z])”, # words with – and ‘
r'(?:[\w_]+)’, # other words
r'(?:\S)’ # anything else
]

tokens_re = re.compile(r'(‘+’|’.join(regex_str)+’)’, re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r’^’+emoticons_str+’$’, re.VERBOSE | re.IGNORECASE)

def tokenize(s):
return tokens_re.findall(s)

def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens

with open(‘Data 20k.json’, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
j = json.loads(line)
tokens = preprocess(tweet[‘text’])
count_all.update(tokens)
print(count_all.most_common(5))

I used everything except the last part where I made some minor changes:

with open(‘Data 20k.json’, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
j = json.loads(line)
tokens = preprocess(tweet[‘text’])
count_all.update(tokens)
print(count_all.most_common(5))

Please help me . Thank you

LikeLike

Reply
1. Marco says:
  
  April 11, 2016 at 5:44 am
  Hi Jeremy, probably one of the lines in your data file are not correct tweets. Make sure you correctly have one tweet per line, no empty lines, and if there are errors (e.g. network bumps from twitter) you can remove them. In the meanwhile the workaround could be:
```
with open('Data 20k.json', 'r') as f:
    count_all = Counter()
    for line in f:
        tweet = json.loads(line)
        if tweet.get('text')
            tokens = preprocess(tweet['text'])
            count_all.update(tokens)
    print(count_all.most_common(5))
```
  Cheers,
  Marco
  
  LikeLike
  Reply
  1. Jeremy says:
    
    April 11, 2016 at 6:06 am
    
    Hi Marco ,
    
    Thanks for the quick reply, I have check my JSON File i noticed there is a blank line between and I have removed it . However I tried running the program again and I get this error instead:
    
    ValueError: Extra Data: line 1 column 4488 – line 1 column 99678411 (char 4487 -99678410)
    
    Any idea?
    
    Thanks
    
    LikeLike
Jeremy says:

April 11, 2016 at 6:09 am

Just to add the error message :

tweet=json.loads(line)
return_default_decoder.decode(s)
raise ValueError(errmsg(“Extra data”, s,end,len(s)))

LikeLike

Reply
meganknightuk says:

July 3, 2016 at 9:35 pm

Hi, thanks so much for this, it’s most helpful. I do have a problem, though. I’m working with a large data file, 80 000 tweets or so, which is in an SQL database. I have a csv of the text of the tweets only, which I have managed to convert to Json and parse using the code on this page and the previous two. I am not a coder, but I can copy-paste and understand basic syntax.

I am trying to extract hashtags, and your script only finds two hashtags in the whole corpus, both of them present, but this is by no means all of them. I am reasonably familiar with the tags in the archive already, since that is part of the rationale for harvesting them. In addition, the count for each tag (and for each term, when I just do terms) is 78289, which is the number of records in the archive.

I’m at a loss, do you have any suggestions as to how I would go about debugging this?

LikeLike

Reply
1. meganknightuk says:
  
  July 3, 2016 at 9:45 pm
  
  Okay, I’ve made some progress. It seems to be seeing only the final tweet in the list, and counting all of those terms 78289 times. I can view the Json file, and the tweets all seem to be there (I don’t have a file with 78289 copies of the same tweet, at least), and when I view the terms (print(tokens) from the code to tokenize) I can see more terms than the nine in the last tweet. Something about the counter isn’t working?
  
  LikeLike
  
  Reply
meganknightuk says:

July 3, 2016 at 9:54 pm

Great, now I’m embarrassed. Problem solved. I was trying to be clever and reuse “tokens” from elsewhere in teh code instead of pre-processing again. Sorry to have bothered you. Thanks again.

LikeLiked by 1 person

Reply
1. Marco says:
  
  July 4, 2016 at 5:38 am
  
  No worries, glad the problem is solved :)
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Aidan says:

July 16, 2016 at 8:37 pm

So in my collection of tweets the token ‘\u2026’ comes up a lot and it is apparently unicode for ellipses ‘…’. When I try to get rid of it by adding it to the list stop (stop=stopwords.words(‘english’)+punctuation+[‘rt’, ‘RT’, ‘via’, ‘\u2026’]
) it still appears, any ideas how to fix this?

LikeLiked by 1 person

Reply
1. Aidan says:
  
  July 16, 2016 at 8:40 pm
  
  I just read the next section where you talk specifically about the unicode characters. It’s easy enough for me just to ignore it.
  
  LikeLiked by 1 person
  
  Reply
Neil Yamit (@neilyamit) says:

September 14, 2016 at 3:38 am

How would one go about counting the frequency of a specific set of phrases (e.g. titles of movies) among all the mined tweets?

LikeLike

Reply
1. Marco says:
  
  September 14, 2016 at 6:20 am
  Hi Neil, you could look into n-grams (phrases of length n). At the end of the article I’ve introduced bigrams (n-grams of length 2) and the concept is very similar. NLTK also offers the nltk.ngrams() function, e.g.
```
from nltk import ngrams
text = "some long text here blah blah"
print(ngrams(text, 4)) # this will print the 4-grams from text
```
  Given the title you’re looking for, you split it on whitespaces so you know the “n” you need.
  Cheers,
  Marco
  
  LikeLike
  Reply
  1. Neil Yamit (@neilyamit) says:
    
    September 15, 2016 at 3:24 am
    
    Thanks! I was playing around with this same problem yesterday and ended up with a crude imlementation with python’s string.count(title). I’ll try ngrams today.
    
    Thanks again for this guide!
    
    LikeLike
Jessicaica says:

September 22, 2016 at 11:06 am

Sorry to bug you again. Your code is really helpful and finally helping me learn the concepts! Yesterday I ran this code and it worked:

import operator
import json
from collections import Counter
fname = ‘C:\Users\Public\Documents\Python Scripts\Clinton.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)

# Create a list with all the hashtags
terms_hash = [term for term in preprocess(tweet[‘text’])
if term.startswith(‘#’)]

# Update the counter
count_all.update(terms_hash)

# Print the first 20 most frequent hashtags

print(count_all.most_common(20))

However, now I get an error: list indices must be integers, not str for the terms_hash line. No idea what happened or how to fix. :-(

LikeLike

Reply
1. Marco says:
  
  September 27, 2016 at 5:03 pm
  
  Hi
  check the format of your json file: each line must be a valid json document, no empty line, i.e. I suspect you have the error when the “tweet” variable is not a dictionary as expected. You could either clean the file manually or wrap your code in a try/except block to capture the error.
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Benjy says:

December 22, 2016 at 10:00 pm

Hi!

I keept getting Unicode characters in my frequency analysis results..
eg.
[(u’science’, 15), (u’\u2026′, 7), (u’2016′, 2), (u’#biology’, 2), (u’computer’, 2), (u’water’, 2), (u’12’, 2), (u’1′, 2), (u’gonna’, 2), (u’code’, 1)]

How do I stop getting the \u2026??

cheers

LikeLike

Reply
1. Marco says:
  
  December 24, 2016 at 9:14 am
  
  Hi Benji, that particular unicode symbol is the ellipsis character, so I suggest you simply add it to the list of stop-words. In the next article (part 4) there’s a brief mention about it, also in the comments. From time to time you’ll have to refine your stop-word list anyway depending on your data.
  Cheers,
  Marco
  
  LikeLike
  
  Reply
  1. Benjy says:
    
    December 27, 2016 at 8:40 pm
    
    Hi Marco – I’ve tried to add it to the stop list but nothing works. I’ve tried adding it as ‘\u2026’ and as ‘…’ but it still shows up in the output.
    Is there something im missing?
    
    currently it is:
    
    stop = stopwords.words(‘english’) + punctuation +[‘rt’, ‘via’, ‘\u2026’]
    
    cheers
    
    LikeLike
  2. Marco says:
    
    January 5, 2017 at 6:58 am
    
    Hi Benji, sorry for the late reply. Try with u’\u2026′ so you have the correct unicode string rather than a str (I’m assuming you’re using Python 2? all the code was tested with 3.4+ so unicode-vs-str is the main source of hiccups)
    Cheers,
    Marco
    
    LikeLike
Omkar Neogi says:

April 8, 2017 at 11:05 pm

Hey if you split using punctuations, the colons, fullstops and slashes in URLs will be removed too. It shouldn’t happen.

LikeLike

Reply
1. Marco says:
  
  April 10, 2017 at 10:06 am
  
  Hi Omkar
  that’s correct, in fact we don’t split using punctuation, we have a custom regex for tokenisation. See Part 2 of this series for the details.
  Cheers
  Marco
  
  LikeLike
  
  Reply
Tiago H. Moreira de Oliveira says:

October 12, 2017 at 4:03 pm

Dear Marco,
First of all I’d like to thank you for your awesome work, I have the book and I can say it is really great.
However, on this section I’m facing an error, which is on this piece of code:

with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
# Create a list with all the terms
terms_all = [term for term in preprocess(tweet[‘text’])]
# Update the counter
count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(5))

The error that I got is on terms_all = [term for term in preprocess(tweet[‘text’])]
TypeError: list indices must be integers, not str

How can I possibly correct this. By this time I don’t have empty lines on the json file, because I managed to delete them.

Thanks in advance and keep up the terrifc work!

LikeLike

Reply
1. Marco says:
  
  October 17, 2017 at 4:37 pm
  
  Hi, you’d expect the variable “tweet” to be a dictionary (with the key “text” among others), so in order to debug this problem you should first confirm if this is the case, and possibly identify which input is causing the problem.
  HTH
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Joses Sandeep (@JosesSandeep) says:

February 14, 2018 at 2:21 am

Hi Marco,
Using the github code in above mentioned:
I am seeing an output like [(u’\ud83c’, 18), (u’RT’, 16), (u’Mars’, 13), (u’\u2026′, 10), (u’\ud83d’, 10)]
The letter U seems to prefix and The word RT is still bound to the output parameters.

Am I missing something basic in this or is it an error?

import sys
import json
from collections import Counter
import re
from nltk.corpus import stopwords
import string

punctuation = list(string.punctuation)
stop = stopwords.words(‘english’) + punctuation + [‘rt’, ‘via’]

emoticons_str = r”””
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]$\]/\\OpP] # Mouth
)”””

regex_str = [
emoticons_str,
r’]+>’, # HTML tags
r'(?:@[\w_]+)’, # @-mentions
r”(?:\#+[\w_]+[\w\’_\-]*[\w_]+)”, # hash-tags
r’http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\($,]|(?:%[0-9a-f][0-9a-f]))+’, # URLs

r'(?:(?:\d+,?)+(?:\.?\d+)?)’, # numbers
r”(?:[a-z][a-z’\-_]+[a-z])”, # words with – and ‘
r'(?:[\w_]+)’, # other words
r'(?:\S)’ # anything else
]

tokens_re = re.compile(r'(‘+’|’.join(regex_str)+’)’, re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r’^’+emoticons_str+’$’, re.VERBOSE | re.IGNORECASE)

def tokenize(s):
return tokens_re.findall(s)

def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens

if __name__ == ‘__main__’:

fname = ‘Filename.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
# Create a list with all the terms
terms_all = [term for term in preprocess(tweet[‘text’]) if term not in stop]
# Update the counter
count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(5))

LikeLike

Reply
Pingback: Protected: Tweets, tweets and … tweets, the final chapter. – Big (Pan) Data
Arif Zuhairi (@AreRex14) says:

September 27, 2018 at 6:16 pm

Thank you, you’re pythonista…what is your phd about?

LikeLike

Reply