This is the third part in a series of articles about data mining on Twitter. After collecting data and pre-processing some text, we are ready for some basic analysis. In this article, we’ll discuss the analysis of term frequencies to extract meaningful terms from our tweets.
Tutorial Table of Contents:
- Part 1: Collecting data
- Part 2: Text Pre-processing
- Part 3: Term Frequencies (this article)
- Part 4: Rugby and Term Co-Occurrences
- Part 5: Data Visualisation Basics
- Part 6: Sentiment Analysis Basics
- Part 7: Geolocation and Interactive Maps
Counting Terms
Assuming we have collected a list of tweets (see Part 1 of the tutorial), the first exploratory analysis that we can perform is a simple word count. In this way, we can observe what are the terms most commonly used in the data set. In this example, I’ll use the set of my tweets, so the most frequent words should correspond to the topics I discuss (not necessarily, but bear with be for a couple of paragraphs).
We can use a custom tokeniser to split the tweets into a list of terms. The following code uses the preprocess() function described in Part 2 of the tutorial, in order to capture Twitter-specific aspects of the text, such as #hashtags, @-mentions, emoticons and URLs. In order to keep track of the frequencies while we are processing the tweets, we can use collections.Counter() which internally is a dictionary (term: count) with some useful methods like most_common():
import operator import json from collections import Counter fname = 'mytweets.json' with open(fname, 'r') as f: count_all = Counter() for line in f: tweet = json.loads(line) # Create a list with all the terms terms_all = [term for term in preprocess(tweet['text'])] # Update the counter count_all.update(terms_all) # Print the first 5 most frequent words print(count_all.most_common(5))
The above code will produce some unimpressive results:
[(':', 44), ('rt', 26), ('to', 26), ('and', 25), ('on', 22)]
As you can see, the most frequent words (or should I say, tokens), are not exactly meaningful.
Removing stop-words
In every language, some words are particularly common. While their use in the language is crucial, they don’t usually convey a particular meaning, especially if taken out of context. This is the case of articles, conjunctions, some adverbs, etc. which are commonly called stop-words. In the example above, we can see three common stop-words – to, and and on. Stop-word removal is one important step that should be considered during the pre-processing stages. One can build a custom list of stop-words, or use available lists (e.g. NLTK provides a simple list for English stop-words).
Given the nature of our data and our tokenisation, we should also be careful with all the punctuation marks and with terms like RT (used for re-tweets) and via (used to mention the original author of an article or a re-tweet), which are not in the default stop-word list.
from nltk.corpus import stopwords import string punctuation = list(string.punctuation) stop = stopwords.words('english') + punctuation + ['rt', 'via']
We can now substitute the variable terms_all in the first example with something like:
terms_stop = [term for term in preprocess(tweet['text']) if term not in stop]
After counting, sorting the terms and printing the top 5, this is the result:
[('python', 11), ('@miguelmalvarez', 9), ('#python', 9), ('data', 8), ('@danielasfregola', 7)]
So apparently I mostly tweet about Python and data, and the users I re-tweet more often are @miguelmalvarez and @danielasfregola, it sounds about right.
More term filters
Besides stop-word removal, we can further customise the list of terms/tokens we are interested in. Here you have some examples that you can embed in the first fragment of code:
# Count terms only once, equivalent to Document Frequency terms_single = set(terms_all) # Count hashtags only terms_hash = [term for term in preprocess(tweet['text']) if term.startswith('#')] # Count terms only (no hashtags, no mentions) terms_only = [term for term in preprocess(tweet['text']) if term not in stop and not term.startswith(('#', '@'))] # mind the ((double brackets)) # startswith() takes a tuple (not a list) if # we pass a list of inputs
After counting and sorting, these are my most commonly used hashtags:
[('#python', 9), ('#scala', 6), ('#nosql', 4), ('#bigdata', 3), ('#nlp', 3)]
and these are my most commonly used terms:
[('python', 11), ('data', 8), ('summarisation', 6), ('twitter', 5), ('nice', 5)]
“nice”?
While the other frequent terms represent a clear topic, more often than not simple term frequencies don’t give us a deep explanation of what the text is about. To put things in context, let’s consider sequences of two terms (a.k.a. bigrams).
from nltk import bigrams terms_bigram = bigrams(terms_stop)
The bigrams() function from NLTK will take a list of tokens and produce a list of tuples using adjacent tokens. Notice that we could use terms_all to compute the bigrams, but we would probably end up with a lot of garbage. In case we decide to analyse longer n-grams (sequences of n tokens), it could make sense to keep the stop-words, just in case we want to capture phrases like “to be or not to be”.
So after counting and sorting the bigrams, this is the result:
[(('nice', 'article'), 4), (('extractive', 'summarisation'), 4), (('summarisation', 'sentence'), 3), (('short', 'paper'), 3), (('paper', 'extractive'), 2)]
So apparently I tweet about nice articles (I wouldn’t bother sharing the boring ones) and extractive summarisation (the topic of my PhD dissertation). This also sounds about right.
Summary
This article has built on top of the previous ones to discuss some basis for extracting interesting terms from a data set of tweets, by using simple term frequencies, stop-word removal and n-grams. While these approaches are extremely simple to implement, they are quite useful to have a bird’s eye view on the data. We have used some components of NLTK (introduced in a previous article), so we don’t have to re-invent the wheel.
Tutorial Table of Contents:
Hi. Thanks for this wonderful guide :-) and hello from Denmark.
I have gone through installation of 32-bit Anaconda and I use iPython and I AM very noobish, since I get an error when I input this code:
import operator
import json
from collections import Counter
fname = ’26-Oktober-data.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
# Create a list with all the terms
terms_all = [term for term in preprocess(tweet[‘text’])]
# Update the counter
count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(5))
TypeError: ‘None-type’ object is not iterable
I have got a json-file named 26-Oktober-data.json with about 14000 tweets that I would like to manipulate, but I keep getting this error. I actually DID manage to get it to work the first time I tried it (Python told me which tokens were used the most), but when I made a repetition an hour later, to learn it again, I got the above error.
What am I missing here? :-)
LikeLike
Hi Jeppe, apologies for the late reply. I’m trying to wrap up the comments on both articles. I’ve been through your latest example with vwdata.json and you’re using the preprocess() function twice. This is a sample code that wraps everything together: https://gist.github.com/bonzanini/3fdc080258fc53bcd3fa
Assuming the data are in JSON Lines (one JSON doc per line, no empty lines), it works fine. Please see if this one solves the previous issues
Cheers,
Marco
LikeLike
import json
import string
fname = open(’26-Oktober-data.json’)
counts = dict()
for line in fname:
line = line.translate(None, string.punctuation)
line = line.lower()
words = line.split()
for word in words:
if word not in counts:
counts[word] = 1
else:
counts[word] +=1
lst = list()
for key, val in counts.items():
lst.append((val,key))
lst.sort(reverse=True)
for key, val in lst[:10]:
print key, val
## this variant work like a charm
LikeLiked by 1 person
Great post, thank you. Is there way to add capitalization to the filters? So that #ThankYou and #thankyou are not treated as two different strings?
LikeLike
Hi bfrost888, sorry for the late reply.
The “track” parameter that you use with the streaming API is not case sensitive. What you can do is simply to process your tweets after you’ve downloaded them (without lowercasing/normalisation, so you keep the original casing)
LikeLike
Hey great Post! But when im using this I get an error while using my data:
line 44, in
terms_all = [term for term in preprocess(tweet[‘text’])]
KeyError: ‘text’
Any thoughts on what I am doing wrong?
Best,
LikeLike
Hi Pydwon,
Just make sure the file you’re trying to process is in JSON Lines format (each line is a JSON document, no empty lines). Occasionally, for a variety of reasons, the response from the streaming API is not a tweet but an error message, which is still a json document, so the whole file is still in the valid JSON Lines format. In this case, you’d need to either filter out the non-tweet lines from the file, or to check whether the tweet dictionary has a text key (I’d go with a clean dataset)
Cheers,
Marco
LikeLike
Hey Marco,
first off, thank you for the quick reply!
I think you might be right with the error response from the API. I collected a whole day, hence the 4 Gb file size which makes an error message very likely. What do you think would be the best and fastest method to clean my JSON up? Opening it in Excel or any text editor?
Best,
LikeLike
I’d briefly go with cat/head/grep e.g.
head -100000 yourdata.jsonl | grep -v “text”
just to see what the error messages look like. With a few lines of python you can then iterate over the file and discard the lines without “text” (assuming only genuine tweets have the key “text”)
Cheers,
Marco
LikeLike
Oi, Marco.
I had some problems with the “d0_something_else”, more reading the comments, I saw that only need to define the function. And I decided to set it with this code, Suggested for you.
fname = ‘python.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
tokens = preprocess(tweet[‘text’])
count_all.update(tokens)
print(count_all.most_common(5))
However, I’m having troubles. These are the errors:
Traceback (most recent call last):
File “C:\Users\USER\workspace\TesteApi\Teste.py”, line 128, in
do_something_else(tokens)
File “C:\Users\USER\workspace\TesteApi\Teste.py”, line 119, in do_something_else
tweet = json.loads(line)
File “C:\Users\USER\AppData\Local\Programs\Python\Python35\lib\json\__init__.py”, line 319, in loads
return _default_decoder.decode(s)
File “C:\Users\USER\AppData\Local\Programs\Python\Python35\lib\json\decoder.py”, line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File “C:\Users\USER\AppData\Local\Programs\Python\Python35\lib\json\decoder.py”, line 357, in raw_decode
raise JSONDecodeError(“Expecting value”, s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)
Please, can you help me ?
*Sorry about my English*
LikeLike
The same thing happens to me and I do not know how to fix it. you got it?
LikeLike
Also have this exact same error
LikeLike
I had the exact same error. I had to put a nested try & except loop within the for loop to get it to work. See my code below.
fname = ‘python.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
try:
tweet = json.loads(line)
# Create a list with all the terms
terms_all = [term for term in preprocess(tweet[‘text’])]
# Update the counter
count_all.update(terms_all)
except:
continue
# Print the first 5 most frequent words
print(count_all.most_common(5))
LikeLike
Hi
Thanks for the great example. I am getting the following error. Can you please help me with this?
Traceback (most recent call last):
File “F:\Studies\MS in CS\Project\init_anal_main.py”, line 11, in
terms_all = [term for term in preprocess(tweet[‘text’])]
NameError: name ‘preprocess’ is not defined
LikeLike
Hi Bala,
the error occurs because the function preprocess() is not defined in your code. You need to include the function definitions from part 2 of the tutorial (i.e. the block with preprocess(), tokenize() and related regular expressions)
Cheers,
Marco
LikeLike
Hi!
Great example, but I have a slight problem.
The tokenizer works great with english language, but it starts breaking with unicode characters.
For example:
Barça, que mĂ¡s veces ha jugado contra 10 en la historia https://t.co/7WUjZrMJah #UCL
tokenizes to:
Bar
ç
a
,
que
m
Ă¡
s
veces
ha
jugado
contra
10
en
la
historia
https://t.co/7WUjZrMJah
#UCL
I need ‘Barça’ instead of ‘Bar’, ‘ç’, ‘a’ ….
The default tokenizer takes care of this but breaks on hashtag,mentions and links, as you mentioned.
Any ways I can get both to work?
Help appreciated! :)
LikeLike
Hi Krishanu,
the TweetTokenizer in NLTK goes one step closer to what you need, but it still breaks on occasions, e.g.
I’d take a look at their source to see if it can be extended/simplified easily.
Cheers,
Marco
LikeLiked by 1 person
Thanks!
Looking forward to your reply!
LikeLike
Hi i tried running your code but I get the error:
Traceback (most recent call last):
File “D:\WinPython-32bit-2.7.10.3\python-2.7.10\twitter_most_common_words.py”, line 58, in
tokens = preprocess(tweet[‘text’])
KeyError: ‘text’
I have no idea what went wrong though.
import sys
import json
from collections import Counter
import re
from nltk.corpus import stopwords
import string
punctuation = list(string.punctuation)
stop = stopwords.words(‘english’) + punctuation + [‘rt’, ‘via’]
emoticons_str = r”””
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]\(\]/\\OpP] # Mouth
)”””
regex_str = [
emoticons_str,
r’]+>’, # HTML tags
r'(?:@[\w_]+)’, # @-mentions
r”(?:\#+[\w_]+[\w\’_\-]*[\w_]+)”, # hash-tags
r’http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+’, # URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)’, # numbers
r”(?:[a-z][a-z’\-_]+[a-z])”, # words with – and ‘
r'(?:[\w_]+)’, # other words
r'(?:\S)’ # anything else
]
tokens_re = re.compile(r'(‘+’|’.join(regex_str)+’)’, re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r’^’+emoticons_str+’$’, re.VERBOSE | re.IGNORECASE)
def tokenize(s):
return tokens_re.findall(s)
def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens
with open(‘Data 20k.json’, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
j = json.loads(line)
tokens = preprocess(tweet[‘text’])
count_all.update(tokens)
print(count_all.most_common(5))
I used everything except the last part where I made some minor changes:
with open(‘Data 20k.json’, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
j = json.loads(line)
tokens = preprocess(tweet[‘text’])
count_all.update(tokens)
print(count_all.most_common(5))
Please help me . Thank you
LikeLike
Hi Jeremy, probably one of the lines in your data file are not correct tweets. Make sure you correctly have one tweet per line, no empty lines, and if there are errors (e.g. network bumps from twitter) you can remove them. In the meanwhile the workaround could be:
Cheers,
Marco
LikeLike
Hi Marco ,
Thanks for the quick reply, I have check my JSON File i noticed there is a blank line between and I have removed it . However I tried running the program again and I get this error instead:
ValueError: Extra Data: line 1 column 4488 – line 1 column 99678411 (char 4487 -99678410)
Any idea?
Thanks
LikeLike
Just to add the error message :
tweet=json.loads(line)
return_default_decoder.decode(s)
raise ValueError(errmsg(“Extra data”, s,end,len(s)))
LikeLike
Hi, thanks so much for this, it’s most helpful. I do have a problem, though. I’m working with a large data file, 80 000 tweets or so, which is in an SQL database. I have a csv of the text of the tweets only, which I have managed to convert to Json and parse using the code on this page and the previous two. I am not a coder, but I can copy-paste and understand basic syntax.
I am trying to extract hashtags, and your script only finds two hashtags in the whole corpus, both of them present, but this is by no means all of them. I am reasonably familiar with the tags in the archive already, since that is part of the rationale for harvesting them. In addition, the count for each tag (and for each term, when I just do terms) is 78289, which is the number of records in the archive.
I’m at a loss, do you have any suggestions as to how I would go about debugging this?
LikeLike
Okay, I’ve made some progress. It seems to be seeing only the final tweet in the list, and counting all of those terms 78289 times. I can view the Json file, and the tweets all seem to be there (I don’t have a file with 78289 copies of the same tweet, at least), and when I view the terms (print(tokens) from the code to tokenize) I can see more terms than the nine in the last tweet. Something about the counter isn’t working?
LikeLike
Great, now I’m embarrassed. Problem solved. I was trying to be clever and reuse “tokens” from elsewhere in teh code instead of pre-processing again. Sorry to have bothered you. Thanks again.
LikeLiked by 1 person
No worries, glad the problem is solved :)
Cheers,
Marco
LikeLike
So in my collection of tweets the token ‘\u2026’ comes up a lot and it is apparently unicode for ellipses ‘…’. When I try to get rid of it by adding it to the list stop (stop=stopwords.words(‘english’)+punctuation+[‘rt’, ‘RT’, ‘via’, ‘\u2026’]
) it still appears, any ideas how to fix this?
LikeLiked by 1 person
I just read the next section where you talk specifically about the unicode characters. It’s easy enough for me just to ignore it.
LikeLiked by 1 person
How would one go about counting the frequency of a specific set of phrases (e.g. titles of movies) among all the mined tweets?
LikeLike
Hi Neil, you could look into n-grams (phrases of length n). At the end of the article I’ve introduced bigrams (n-grams of length 2) and the concept is very similar. NLTK also offers the nltk.ngrams() function, e.g.
Given the title you’re looking for, you split it on whitespaces so you know the “n” you need.
Cheers,
Marco
LikeLike
Thanks! I was playing around with this same problem yesterday and ended up with a crude imlementation with python’s string.count(title). I’ll try ngrams today.
Thanks again for this guide!
LikeLike
Sorry to bug you again. Your code is really helpful and finally helping me learn the concepts! Yesterday I ran this code and it worked:
import operator
import json
from collections import Counter
fname = ‘C:\Users\Public\Documents\Python Scripts\Clinton.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
# Create a list with all the hashtags
terms_hash = [term for term in preprocess(tweet[‘text’])
if term.startswith(‘#’)]
# Update the counter
count_all.update(terms_hash)
# Print the first 20 most frequent hashtags
print(count_all.most_common(20))
However, now I get an error: list indices must be integers, not str for the terms_hash line. No idea what happened or how to fix. :-(
LikeLike
Hi
check the format of your json file: each line must be a valid json document, no empty line, i.e. I suspect you have the error when the “tweet” variable is not a dictionary as expected. You could either clean the file manually or wrap your code in a try/except block to capture the error.
Cheers,
Marco
LikeLike
Hi!
I keept getting Unicode characters in my frequency analysis results..
eg.
[(u’science’, 15), (u’\u2026′, 7), (u’2016′, 2), (u’#biology’, 2), (u’computer’, 2), (u’water’, 2), (u’12’, 2), (u’1′, 2), (u’gonna’, 2), (u’code’, 1)]
How do I stop getting the \u2026??
cheers
LikeLike
Hi Benji, that particular unicode symbol is the ellipsis character, so I suggest you simply add it to the list of stop-words. In the next article (part 4) there’s a brief mention about it, also in the comments. From time to time you’ll have to refine your stop-word list anyway depending on your data.
Cheers,
Marco
LikeLike
Hi Marco – I’ve tried to add it to the stop list but nothing works. I’ve tried adding it as ‘\u2026’ and as ‘…’ but it still shows up in the output.
Is there something im missing?
currently it is:
stop = stopwords.words(‘english’) + punctuation +[‘rt’, ‘via’, ‘\u2026’]
cheers
LikeLike
Hi Benji, sorry for the late reply. Try with u’\u2026′ so you have the correct unicode string rather than a str (I’m assuming you’re using Python 2? all the code was tested with 3.4+ so unicode-vs-str is the main source of hiccups)
Cheers,
Marco
LikeLike
Hey if you split using punctuations, the colons, fullstops and slashes in URLs will be removed too. It shouldn’t happen.
LikeLike
Hi Omkar
that’s correct, in fact we don’t split using punctuation, we have a custom regex for tokenisation. See Part 2 of this series for the details.
Cheers
Marco
LikeLike
Dear Marco,
First of all I’d like to thank you for your awesome work, I have the book and I can say it is really great.
However, on this section I’m facing an error, which is on this piece of code:
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
# Create a list with all the terms
terms_all = [term for term in preprocess(tweet[‘text’])]
# Update the counter
count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(5))
The error that I got is on terms_all = [term for term in preprocess(tweet[‘text’])]
TypeError: list indices must be integers, not str
How can I possibly correct this. By this time I don’t have empty lines on the json file, because I managed to delete them.
Thanks in advance and keep up the terrifc work!
LikeLike
Hi, you’d expect the variable “tweet” to be a dictionary (with the key “text” among others), so in order to debug this problem you should first confirm if this is the case, and possibly identify which input is causing the problem.
HTH
Cheers,
Marco
LikeLike
Hi Marco,
Using the github code in above mentioned:
I am seeing an output like [(u’\ud83c’, 18), (u’RT’, 16), (u’Mars’, 13), (u’\u2026′, 10), (u’\ud83d’, 10)]
The letter U seems to prefix and The word RT is still bound to the output parameters.
Am I missing something basic in this or is it an error?
import sys
import json
from collections import Counter
import re
from nltk.corpus import stopwords
import string
punctuation = list(string.punctuation)
stop = stopwords.words(‘english’) + punctuation + [‘rt’, ‘via’]
emoticons_str = r”””
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]\(\]/\\OpP] # Mouth
)”””
regex_str = [
emoticons_str,
r’]+>’, # HTML tags
r'(?:@[\w_]+)’, # @-mentions
r”(?:\#+[\w_]+[\w\’_\-]*[\w_]+)”, # hash-tags
r’http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+’, # URLs
r'(?:(?:\d+,?)+(?:\.?\d+)?)’, # numbers
r”(?:[a-z][a-z’\-_]+[a-z])”, # words with – and ‘
r'(?:[\w_]+)’, # other words
r'(?:\S)’ # anything else
]
tokens_re = re.compile(r'(‘+’|’.join(regex_str)+’)’, re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r’^’+emoticons_str+’$’, re.VERBOSE | re.IGNORECASE)
def tokenize(s):
return tokens_re.findall(s)
def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens
if __name__ == ‘__main__’:
fname = ‘Filename.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
# Create a list with all the terms
terms_all = [term for term in preprocess(tweet[‘text’]) if term not in stop]
# Update the counter
count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(5))
LikeLike
Thank you, you’re pythonista…what is your phd about?
LikeLike