Mining Twitter Data with Python (Part 2: Text Pre-processing)

This is the second part of a series of articles about data mining on Twitter. In the previous episode, we have seen how to collect data from Twitter. In this post, we’ll discuss the structure of a tweet and we’ll start digging into the processing steps we need for some text analysis.

Table of Contents of this tutorial:

Part 1: Collecting data
Part 2: Text Pre-processing (this article)
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

The Anatomy of a Tweet

Assuming that you have collected a number of tweets and stored them in JSON as suggested in the previous article, let’s have a look at the structure of a tweet:

import json

with open('mytweets.json', 'r') as f:
    line = f.readline() # read only the first tweet/line
    tweet = json.loads(line) # load it as Python dict
    print(json.dumps(tweet, indent=4)) # pretty-print

The key attributes are the following:

text: the text of the tweet itself
created_at: the date of creation
favorite_count, retweet_count: the number of favourites and retweets
favorited, retweeted: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet
lang: acronym for the language (e.g. “en” for english)
id: the tweet identifier
place, coordinates, geo: geo-location information if available
user: the author’s full profile
entities: list of entities like URLs, @-mentions, hashtags and symbols
in_reply_to_user_id: user identifier if the tweet is a reply to a specific user
in_reply_to_status_id: status identifier id the tweet is a reply to a specific status

As you can see there’s a lot of information we can play with. All the *_id fields also have a *_id_str counterpart, where the same information is stored as a string rather than a big int (to avoid overflow problems). We can imagine how these data already allow for some interesting analysis: we can check who is most favourited/retweeted, who’s discussing with who, what are the most popular hashtags and so on. Most of the goodness we’re looking for, i.e. the content of a tweet, is anyway embedded in the text, and that’s where we’re starting our analysis.

We start our analysis by breaking the text down into words. Tokenisation is one of the most basic, yet most important, steps in text analysis. The purpose of tokenisation is to split a stream of text into smaller units called tokens, usually words or phrases. While this is a well understood problem with several out-of-the-box solutions from popular libraries, Twitter data pose some challenges because of the nature of the language.

How to Tokenise a Tweet Text

Let’s see an example, using the popular NLTK library to tokenise a fictitious tweet:

from nltk.tokenize import word_tokenize

tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP'
print(word_tokenize(tweet))
# ['RT', '@', 'marcobonzanini', ':', 'just', 'an', 'example', '!', ':', 'D', 'http', ':', '//example.com', '#', 'NLP']

You will notice some peculiarities that are not captured by a general-purpose English tokeniser like the one from NLTK: @-mentions, emoticons, URLs and #hash-tags are not recognised as single tokens. The following code will propose a pre-processing chain that will consider these aspects of the language.

import re

emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""

regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs

    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
   
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)

def tokenize(s):
    return tokens_re.findall(s)

def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP'
print(preprocess(tweet))
# ['RT', '@marcobonzanini', ':', 'just', 'an', 'example', '!', ':D', 'http://example.com', '#NLP']

As you can see, @-mentions, emoticons, URLs and #hash-tags are now preserved as individual tokens.

If we want to process all our tweets, previously saved on file:

with open('mytweets.json', 'r') as f:
    for line in f:
        tweet = json.loads(line)
        tokens = preprocess(tweet['text'])
        do_something_else(tokens)

The tokeniser is probably far from perfect, but it gives you the general idea. The tokenisation is based on regular expressions (regexp), which is a common choice for this type of problem. Some particular types of tokens (e.g. phone numbers or chemical names) will not be captured, and will be probably broken into several tokens. To overcome this problem, as well as to improve the richness of your pre-processing pipeline, you can improve the regular expressions, or even employ more sophisticated techniques like Named Entity Recognition.

The core component of the tokeniser is the regex_str variable, which is a list of possible patterns. In particular, we try to capture some emoticons, HTML tags, Twitter @usernames (@-mentions), Twitter #hashtags, URLs, numbers, words with and without dashes and apostrophes, and finally “anything else”. Please take a moment to observe the regexp for capturing numbers: why don’t we just use \d+? The problem here is that numbers can appear in several different ways, e.g. 1000 can also be written as 1,000 or 1,000.00 — and we can get into more complications in a multi-lingual environment where commas and dots are inverted: “one thousand” can be written as 1.000 or 1.000,00 in many non-anglophone countries. The task of identifying numeric tokens correctly just gives you a glimpse of how difficult tokenisation can be.

The regular expressions are compiled with the flags re.VERBOSE, to allow spaces in the regexp to be ignored (see the multi-line emoticons regexp), and re.IGNORECASE to catch both upper and lowercases. The tokenize() function simply catches all the tokens in a string and returns them as a list. This function is used within preprocess(), which is used as a pre-processing chain: in this case we simply add a lowercasing feature for all the tokens that are not emoticons (e.g. :D doesn’t become :d).

Summary

In this article we have analysed the overall structure of a tweet, and we have discussed how to pre-process the text before we can get into some more interesting analysis. In particular, we have seen how tokenisation, despite being a well-understood problem, can get tricky with Twitter data. The proposed solution is far from perfect but it’s a good starting point, and fairly easy to extend.

@MarcoBonzanini

Table of Contents of this tutorial:

Part 1: Collecting data
Part 2: Text Pre-processing (this article)
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

Published by

Marco

Data Scientist View all posts by Marco

81 thoughts on “Mining Twitter Data with Python (Part 2: Text Pre-processing)”

brat197 says:

April 11, 2015 at 5:27 am

Should there be “”” at line 19?

LikeLike

Reply
1. Marco says:
  
  April 11, 2015 at 8:25 am
  
  nope, I fixed it. Thanks, good catch :)
  
  LikeLike
  
  Reply
Guo ShiYu says:

July 13, 2015 at 2:33 pm

At first, thank you for this very useful article. I am a student of University of Tsukuba, Japan, and I’m doing the research about Twitter for analyzing the features of celebrity users’ tweets. I think your article could help a lot, but I cannot see all the codes, just few lines. Could you tell me how to see all the codes? Thanks again.

LikeLike

Reply
Raju Kadam (@rkadam) says:

October 15, 2015 at 12:06 am

Awesome series of blogs! Very helpful, thank you! NLTK already has TweetTokenizer :) I guess you just wanted to start from most basic step i.e. using regex and then build on it.

LikeLiked by 1 person

Reply
1. Marco says:
  
  October 15, 2015 at 5:21 am
  
  Thanks Raju, that’s correct, the approach is also very similar, with a long list of regex to capture different types of token — I just simplified it a bit for readability
  
  Cheers
  Marco
  
  LikeLike
  
  Reply
fathan says:

October 22, 2015 at 6:31 am

Hello Marco, I’ve got error. The message is “NameError: name ‘re’ is not defined”
What does ‘re’ mean in you code here “re.compile.. bla..bla..bla” ?
Do I need to import something?
Thank you

LikeLike

Reply
1. Marco says:
  
  October 22, 2015 at 6:38 am
  
  Hi fathan, thanks for spotting this, the import was missing at the beginning of the snippet so I fixed it. The “re” module is part of the standard library and deals with regular expressions: https://docs.python.org/3/library/re.html
  
  Cheers
  Marco
  
  LikeLike
  
  Reply
Alma says:

November 24, 2015 at 11:22 am

Hello Marco

Thanks so much for this nice presentation and it is really helpful.I though have some problems with running the code specially when I want to read the json file my directory.

with open(‘company.json, ‘r’) as f:
for line in f:
tweet = json.loads(line)
tokens = preprocess(tweet[‘text’])
do_something_else(tokens)

it does not know either line,preprocess and do_something_else although preprocesss and tokens and already defined.Would you please release the whole code fo this section so I can fix my mistakes?

Thanks

LikeLike

Reply
1. Marco says:
  
  November 24, 2015 at 7:29 pm
  
  Hi Alma, the do_something_else() function is just a place-holder for you to implement depending on your application needs. You can for example have a look at the third article (Term frequencies) to see how to do some basic frequency analysis.
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Jigar Mehta says:

December 4, 2015 at 8:31 am

Hi.. I have the tweets text stored in a pandas dataframe in column text. When I use like this: print(preprocess(dataframe[‘text’])).. I get an error…”expected string or buffer”. How do I convert the text column of dataframe into text. Its already into text. Pls help

LikeLike

Reply
1. Marco says:
  
  December 4, 2015 at 9:58 am
  Hi Jigar. dataframe[‘text’] is not a string, but rather a list of strings (to be more precise, a pandas.Series which is an array-like object, where each item is a string). So you simply need to iterate through it:
```
for text in dataframe['text']:
    print(preprocess(text))
```
  Cheers,
  Marco
  
  LikeLike
  Reply
Manuel Köck says:

December 4, 2015 at 7:08 pm

Great Article :)

Maybe you could help me out :)
My problem is the JSON i guess when i run the code with

with open(‘mytweets.json, ‘r’) as f:
for line in f:
tweet = json.loads(line)
tokens = preprocess(tweet[‘text’])

I get an error:
File “.\testing.py”, line 50
with open(‘stream_python.json, ‘r’) as f:
^
SyntaxError: EOL while scanning string literal

//Note i have an json file containing several tweets with an row between them
//Note2 I have downloaded twitter_stream_downloader.py and saved some tweets of #facebook in an file then i tried your solutions on Text Pre-processing but i get the error mentioned above?
So could this be a problem from the free row between each tweet in my json file?
NOte3 When i go to an website to “print” my json it says : There are multiple root elements in the stream_python.json file
So the error is caused that my json file is not “good” or?

Thank you for your awesome article

LikeLike

Reply
1. Jigar Mehta says:
  
  December 5, 2015 at 1:22 am
  
  with open(‘stream_python.json, ‘r’) as f:….u r missing a close single inverted comma at the end of file name…… ‘stream_python.josn’ …it will work!
  
  CHEERS Jigar
  
  LikeLike
  
  Reply
  1. Marco says:
    
    December 5, 2015 at 12:33 pm
    
    Thanks Jigar, well spotted
    
    LikeLike
2. Marco says:
  
  December 5, 2015 at 1:11 pm
  
  Hi, just to add to what Jigar has already mentioned, regarding note 3: the file is not in JSON format, but rather in JSON Lines (http://jsonlines.org/). With JSON Lines, each line is a valid JSON (and there are no empty lines). So if you try to load the whole file, you don’t have a single JSON document, hence the “multiple root elements” error
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Jeppe Gade says:

December 8, 2015 at 10:52 am

Hi. Thanks for this wonderful guide :-) and hello from Denmark.

I have gone through installation of 32-bit Anaconda and I use iPython and I AM very noobish, since I get an error when I input this code:

import operator
import json
from collections import Counter

fname = ’26-Oktober-data.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
# Create a list with all the terms
terms_all = [term for term in preprocess(tweet[‘text’])]
# Update the counter
count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(5))

TypeError: ‘None-type’ object is not iterable

I have got a json-file named 26-Oktober-data.json with about 14000 tweets that I would like to manipulate, but I keep getting this error. I actually DID manage to get it to work the first time I tried it (Python told me which tokens were used the most), but when I made a repetition an hour later, to learn it again, I got the above error.

What am I missing here? :-)

LikeLike

Reply
jiizaa says:

December 8, 2015 at 4:03 pm

Hi Marco;
Thank you very much for this great article, but It seems like I have a problem with this line : ” tweet = json.loads(line)”
This is the error I get:
ValueError: Expecting value: line 2 column 1 (char 1).
What’s wrong? could you please help me?

LikeLike

Reply
1. Jeppe Gade says:
  
  December 8, 2015 at 4:29 pm
  
  I’ve got that same issue actually… help would be much appreciated :)
  
  LikeLike
  
  Reply
2. Jeppe says:
  
  December 9, 2015 at 2:52 pm
  
  I have figured out what caused MY error:
  
  The .json file I used had 3 lines of spaces at the bottom of the file – I deleted the three lines and now I get no error when I parse it :)
  
  LikeLike
  
  Reply
  1. Marco says:
    
    December 9, 2015 at 5:42 pm
    
    Hi Jeppe, good to know that you got this sorted out. Just to expand on this, the file is meant to be in JSON Lines format (http://jsonlines.org), where each line is supposed to be a proper JSON document
    
    Cheers, Marco
    
    LikeLike
  2. yousufmotiwala says:
    
    August 26, 2017 at 4:46 pm
    
    hi Jeppe, i am facing the same problem…need your help in deleting those empty lines…thank you in advance:)
    
    LikeLike
jiizaa says:

December 8, 2015 at 4:04 pm

Hi Marco;
Thank you very much for this great article, but It seems like I have a problem with this line : ” tweet = json.loads(line)”
This is the error I get:
ValueError: Expecting value: line 2 column 1 (char 1).
What’s wrong? could you please help me?

LikeLike

Reply
1. Marco says:
  
  December 9, 2015 at 5:40 pm
  
  Hi jiizaa, the error is very likely to be caused by an empty line in the file. The file collected (using the streaming method in article 1) should be in JSON Lines format (http://jsonlines.org) where each line is meant to be a proper JSON document. The json.loads() function will raise the same ValueError also in case of invalid string. Solutions: fix the file, or put that line in a try/except block to capture the exception.
  
  Hope this helps,
  Marco
  
  LikeLike
  
  Reply
  1. Jeppe says:
    
    December 9, 2015 at 5:59 pm
    
    Hi Marco
    
    Thank you so much for replying!
    
    It baffles me how I could get it to work the first time, but now I get this error:
    TypeError: ‘NoneType’ object is not iterable
    
    when executing this code:
    fname = ‘vwdata.json’
    with open(fname, ‘r’) as f:
    count_all = Counter()
    for line in f:
    tweet = json.loads(line)
    tokens = preprocess(tweet[‘text’])
    terms_all = [term for term in preprocess(tweet[‘text’])]
    count_all.update(terms_all)
    print(count_all.most_common(5))
    
    What is wrong with this statement?
    
    LikeLike
  2. Marco says:
    
    December 9, 2015 at 6:08 pm
    
    which line is causing the error? The error says you’re trying to iterate over a None value
    
    LikeLike
  3. Jeppe says:
    
    December 9, 2015 at 6:26 pm
    
    It’s this line that is causing the problem:
    terms_all = [term for term in preprocess(tweet[‘text’])]
    
    Your help is truly appreciated!
    
    LikeLike
  4. Marco says:
    
    December 9, 2015 at 7:59 pm
    
    You need to investigate why the preprocess() functions is returning None occasionally. If you call it manually from outside the loop, does it return a list?
    
    LikeLike
  5. Jeppe says:
    
    December 9, 2015 at 8:16 pm
    
    I do not know how to call the function from outside the loop – if I write:
    preprocess()
    
    in the terminal, I get this error:
    preprocess() missing 1 required positional argument: ‘s’
    
    What argument do I need to pass for this to work?
    
    LikeLike
  6. Marco says:
    
    December 10, 2015 at 6:45 am
    
    It takes a string (e.g. the text of a tweet) and it returns a list of strings (tokens)
    
    LikeLike
  7. Jeppe says:
    
    December 10, 2015 at 1:18 pm
    
    Ok, I have now tried to pass a string with the preprocess() function – writing:
    
    preprocess(‘This is a string of text for testing.’)
    
    and it returns… nothing. The command line just accepts the input and gives me a new line for inputting the next command. No list is generated regardless of what string I pass in preprocess()
    
    LikeLike
  8. jiizaa says:
    
    December 13, 2015 at 10:01 pm
    
    Thanks Marco, it worked.. I had empty lines in my JSON file. Thanks for your help :)
    
    LikeLike
  9. yousufmotiwala says:
    
    August 26, 2017 at 5:06 pm
    
    Hello Marco, First of all thanx for such a helpful blog….
    i am also getting this same error “json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)”
    on this line “tweet = json.loads(line)”
    you said we have to have these files in jason lines format…but i did not get it, and also couldn’t find anything useful on the web…will you please help me out with this….thanx in advance..:)
    
    LikeLike
dataanalyticsscience says:

December 17, 2015 at 3:46 am

Thanks for this tutorial. I preprocessed the data in json file. My question is if there is a better way to handle non-English tweets. For example, one tweet is shown as following:

[u’QR’, u’\u30b3′, u’\u30fc’, u’\u30c9′, u’\u3092′, u’\u7d20′, u’\u65e9′, u’\u304f’, u’\u8a8d’, u’\u8a3c’, u’\u3057′, u’\u3001′, u’\u5185′, u’\u8535′, u’\u30d6′, u’\u30e9′, u’\u30a6′, u’\u30b6′, u’\u30fc’]

LikeLike

Reply
1. Marco says:
  
  December 29, 2015 at 2:44 pm
  
  Hi, sorry for the late reply. Tokenization is specific for a language, or at least for a family of languages. If you’re mostly dealing with non-latin alphabets, you’d better look for a tokenization library which deals with your target language. I don’t have a specific recommendation but there are open source options out there (e.g. pyMMSeg for Chinese, TinySegmenter for Japanese, etc.)
  Cheers, Marco
  
  LikeLike
  
  Reply
Dave says:

February 2, 2016 at 1:30 am

When I try to load the json file that I pulled from the streaming twitter api i get a
ValueError: Expecting value: line 2 column 1 (char 1)

I guess this is because twitter streaming API json files are newline-delimited, and the blank line is messing up the json.load() method? Is there a simple way around this?

Thanks,

LikeLike

Reply
1. Marco says:
  
  February 2, 2016 at 7:31 am
  
  Hi Dave, there’s some discussion about the file format and this particular error in the comments above (look for “ValueError” on this page)
  Cheers,
  Marco
  
  LikeLike
  
  Reply
  1. Dave says:
    
    February 3, 2016 at 10:51 pm
    
    Hi Marco,
    
    Yeah sorry, I did see that afterwards. Here is some code I wrote that solved this problem for me in case anyone else needs a quick fix.
    
    with open(fname, ‘r’) as f:
    count_all = Counter()
    for line in f:
    try:
    tweet = json.loads(line)
    except ValueError:
    pass
    
    More specifically though I was wondering why something like this requires a “work around”. I see the value in having a file of json objects as opposed to a json file, but is there a more elegant way to preprocess these files?
    
    Thanks for the help, love your posts.
    
    LikeLiked by 1 person
2. Marco says:
  
  February 4, 2016 at 6:43 am
  
  I think in general the original code can be improved to be made more robust, e.g. to handle these little bumps in the first place, but for the sake of clarity/brevity I’ve skipped some try/except here and there in the explanation. These could be added either before processing the file, to ensure it’s in a clean format, or during processing as you’ve done. I’m not sure you can get more elegant than a single try/except block :)
  Cheers,
  Marco
  
  LikeLike
  
  Reply
wenlei says:

February 3, 2016 at 2:27 am

Hi-, Marco, I am from WPI. it is interesting to see your post. I try to follow your code.
But I run into error at terms_all = [term for term in preprocess(tweet[‘text’])]
the error info says TypeError: list indices must be integers, not str. Would you please take a look where it went wrong? thanks a lot

import operator
import json
from collections import Counter
from nltk.tokenize import word_tokenize
import re

emoticons_str = r”””
(?:
[:=;] # Eyes
[oO\-]? # Nose (optional)
[D\)\]$\]/\\OpP] # Mouth
)”””

regex_str = [
emoticons_str,
r’]+>’, # HTML tags
r'(?:@[\w_]+)’, # @-mentions
r”(?:\#+[\w_]+[\w\’_\-]*[\w_]+)”, # hash-tags
r’http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\($,]|(?:%[0-9a-f][0-9a-f]))+’, # URLs

r'(?:(?:\d+,?)+(?:\.?\d+)?)’, # numbers
r”(?:[a-z][a-z’\-_]+[a-z])”, # words with – and ‘
r'(?:[\w_]+)’, # other words
r'(?:\S)’ # anything else
]

tokens_re = re.compile(r'(‘+’|’.join(regex_str)+’)’, re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r’^’+emoticons_str+’$’, re.VERBOSE | re.IGNORECASE)

def tokenize(s):
return tokens_re.findall(s)

def preprocess(s, lowercase=False):
tokens = tokenize(s)
if lowercase:
tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
return tokens

fname = ‘Election.json’
with open(fname, ‘r’) as f:
count_all = Counter()
for line in f:
tweet = json.loads(line)
# Create a list with all the terms
terms_all = [term for term in preprocess(tweet[‘text’])]
# Update the counter
count_all.update(terms_all)
# Print the first 5 most frequent words
print(count_all.most_common(5))

LikeLike

Reply
wenlei says:

February 3, 2016 at 2:41 am

Hi-, Marco,
I got error at the following. Am I missing something?
terms_all = [term for term in preprocess(tweet[‘text’])]
the error is
list indices must be integers, not str
thanks

LikeLike

Reply
1. Marco says:
  
  February 3, 2016 at 7:05 am
  
  Hi wenlei
  check the format of your Election.json file: each line must be a correct json document, no empty lines. You have the error when e.g. the “tweet” variable is not a dictionary as expected
  Cheers,
  Marco
  
  LikeLike
  
  Reply
wenlei says:

February 3, 2016 at 7:35 pm

thanks a lot, Marco. first, I apologize for post question twice. I did not see the first one get posted.
You are right. my election.json is generated by other twitter package. that could be not go line by line. I have tried your way using canopy which only has tweepy 2.1. It hang there and I did not see file generated. Do you see any work around if I use the existing json?
again thank you very much.

LikeLike

Reply
1. Marco says:
  
  February 4, 2016 at 6:23 am
  You could simply wrap the line that is causing the problem in a try/except block, e.g. something similar to:
```
try:
    terms_all = [term for term in preprocess(tweet[‘text’])]
except TypeError:
    pass
```
  Another option is to clean the json file beforehand
  Cheers,
  Marco
  
  LikeLike
  Reply
Johan says:

February 23, 2016 at 3:14 pm

Hi Marco! A really nice guide, but unfortunately I must have missed something!

I store data with
“class MyListener(StreamListener):

def on_data(self, data):
try:
with open(‘Saved data/’ + filename + ‘.json’, ‘a’) as outfile:
outfile.write(data)
return True
except BaseException as e:
print(‘Error on_data: %s’ % str(e))
return True

def on_error(self, status):
print(status)
return True

if stream:
twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=[tracking_string])”

And tries to preprocesses it with
“with open(‘saved data/’ + filename + ‘.json’, ‘r’) as infile:
for line in infile:
tweet = json.loads(line)
tokens = preprocess(tweet[‘text’])
print(‘mark’)
”

My json file has more than one line but it always stops after the first loop (it prints “mark” one time but then gives the error mentioned above with error @ “tweet = json.loads(line)” and “Expecting value: line 2 column 1 (char 1)”. I guess that the code doesn’t get the second line but I’ve opened it to check the format and it seems right (no blank lines etc.). Any idéa what can be wrong?

Thanks in advance

LikeLike

Reply
Joy Tagle (@joytagle) says:

March 6, 2016 at 5:15 am

Hi! Your blog is very helpful to newbies like me :) I’ve got a problem though. Cause I streamed some data from tweeter (which I learned from your Part1) problem or difference is I saved it in a txt file. Now I’m trying to do this part 2 and it’s giving me this error: line 40, in
tokens = preprocess(tweet[‘text’])
KeyError: ‘text’
Should I be replacing the text with a different thing? Or is it because I saved my tweets in a txt file? Hope to get a help from you :)
It’s really good to know there’s a lot of help like this around.

LikeLike

Reply
1. Jeppe says:
  
  March 6, 2016 at 12:08 pm
  
  Hi Joy Tagle – I would like to answer that
  
  You are on to something here… Because you have to save your data in a .json format and not text.
  
  If the syntax in the file is correct, you can simply rename the extension (.txt) to .json – that Should do the trick, and afterwards you should be able to “preprocess” it… If not: please don’t hesitate to write here again.
  
  Best regards,
  Jeppe
  
  LikeLike
  
  Reply
2. Marco says:
  
  March 7, 2016 at 8:58 am
  
  Hi Joy,
  the file extension per se doesn’t really matter. The important aspect is the format of the data: each line has to be a proper JSON document, one document per line, no empty lines. Also, make sure you’ve closed the stream so you’re not writing on the file while you attempt to read it. You can also print the tweet before calling preprocess if it helps debugging.
  Cheers,
  Marco
  
  LikeLike
  
  Reply
  1. fanilaYounis says:
    
    March 29, 2016 at 5:50 pm
    
    Hey macro what if i want to apply all this on the data saved on database rather than json format.
    kindly reply
    
    LikeLike
MAYUR HAMPIHOLI says:

March 21, 2016 at 5:23 pm

Thanks Marco for an amazing document. I can help clear the issues some people have as I have faced them myself. If you have produced .json files from Tweepy or any other twitter streaming api just make sure that your .json file has continuous lists as objects. Simply, get rid of any newline(or whitespace) that occurs between these tweets and you’re good to go with the code. The crux of the problem is understanding what json.dumps does and understanding that the .json file needs to be without whitespace or newlines.
You could refer to this->
https://docs.python.org/3/library/json.html
http://jsonlines.org/

LikeLike

Reply
Kel3vra says:

May 12, 2016 at 6:21 pm

is there any ideas how to clean my data from those “\u0435”
[(u’#news’, 1068), (u’\u043e’, 798), (u’\u0438′, 623), (u’\u0430′, 613), (u’\u0435′, 593), (u’\u043d’, 519), (u’#News’, 464), (u’\u0441′, 451), (u’\u0442′, 443), (u’\u0432′, 430), (u’\u0440′, 400), (u’\u3057′, 372), (u’\u30fc’, 355), (u’\u3044′, 344), (u’RT’, 302)]

LikeLike

Reply
1. MAYUR HAMPIHOLI says:
  
  May 14, 2016 at 7:15 am
  
  I’m not quite sure what you mean Kel3vra. But if you seek to remove unicode cahracters from within the tweet itself you can use regex —
  temp_str=tweet[‘text’]
  #regex to remove unicode
  temp_str = re.sub(r'[^\x00-\x7F]+’,”, temp_str)
  
  LikeLike
  
  Reply
Manu Sharma says:

June 22, 2016 at 12:44 pm

Hi Marco – Its really helpful, Just 1 question from my side, Do you think instead of tokenizer

tweet.split( ‘ ‘) can be a better option, bec I see its almost working as re expression

LikeLike

Reply
1. Marco says:
  
  June 25, 2016 at 9:49 am
  
  Hi Manu, using split() over whitespaces works only for toy examples. As soon as you have some punctuation, you’re going to miss it. You can also look into nltk.tokenize.TweetTokenizer if you don’t need any particular customisation, it works pretty well
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Rekha says:

June 30, 2016 at 8:18 pm

Hi Marco,
Thank you very much for this enormously useful step-by-step guide. I have a question about the preprocess function – is this supposed to convert all tokens other than emoticons to lower case? If that is the case, why are ‘RT’ and ‘#NLP still in upper case? I see the same thing while implementing this code with my file.

print(preprocess(tweet))
# [‘RT’, ‘@marcobonzanini’, ‘:’, ‘just’, ‘an’, ‘example’, ‘!’, ‘:D’, ‘http://example.com’, ‘#NLP’]

Thanks!
Rekha

LikeLike

Reply
1. Marco says:
  
  July 1, 2016 at 5:20 am
  
  Hi Rekha, if you see the definition of the preprocess() function, the lowercasing is optional and it defaults to False, e.g. try preprocess(tweet, lowercase=True)
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Shiv Golani says:

July 7, 2016 at 6:02 am

Thank you for the Awesome Tutorial.

LikeLike

Reply
Vladka says:

July 21, 2016 at 11:49 am

Hello! You can use twitter tokenizer in nltk package.
It can proceed smiles, hashtags and e.t.c.
TweetTokenizer
http://www.nltk.org/api/nltk.tokenize.html

LikeLike

Reply
Shwet says:

July 21, 2016 at 7:01 pm

Hello Marc,
Your post is very helpful but I’m stuck at some error, Please help me fix it.
This is url of my code:
https://github.com/shwetkm/IndvsWI-Twitter-Miner/blob/master/tokenizer.py

The error shown is:

raise JSONDecodeError(“Expecting value”, s, err.value) from None

JSONDecodeError: Expecting value

LikeLike

Reply
1. Marco says:
  
  July 22, 2016 at 5:54 am
  Hi Shwet, the problem is probably given by the json file, please have a look at the comments above (you might have for example empty lines in your file). Another possible source of problem is the new line character used in Windows, so you might need to open your file with:
```
with open(fname, 'r', newline='\r\n') as f:
```
  Cheers,
  Marco
  
  LikeLiked by 2 people
  Reply
  1. Shwet says:
    
    July 22, 2016 at 2:42 pm
    
    Thanks Marco
    Problem is solved :)
    
    LikeLike
Pingback: A Look at Trump and Clinton’s Tweets Using Tweepy – Part 2: Term Frequency – Keith Selover
Alex says:

November 9, 2016 at 11:53 pm

Hi Marco,

Thank you for the great tutorial. I have a question about using twitter data. When I test preprocess() function with a normal text I type, it works great as you did. However, when I apply it to my twitter data, there is an extra ” u’ ” in front of every words. Is it some mechanism or rule for twitter data or something went wrong?

LikeLike

Reply
1. Marco says:
  
  November 10, 2016 at 7:27 am
  
  Hi Alex, the ‘u’ in front of strings stands for Unicode in Python 2. These examples have been tested in Python 3 so there might be some hiccup here and there if you run them on Python 2. Anyway this is one of the differences between Python 2 and Python 3: in Py2, you have the data type str which holds ASCII strings and a separate unicode data type, while in Py3 the data types have been unified so there’s a str that holds unicode text. There’s a lot of documentation online about this topic in case you stumble on some encoding/decoding issue.
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
lpulsc-obetf says:

February 26, 2017 at 12:42 am

Hi Marco,

Thank you so much for this info.

I’m currently doing text pre-processing of facebook dataset mostly comments.
How can I use this code for facebook data?

Thank you,

Best cheers,
Isaac

LikeLike

Reply
1. Marco says:
  
  February 28, 2017 at 4:40 pm
  
  Hi
  the concepts can be reused for Facebook data quite easily. You can check out some of the Facebook examples from my book: https://github.com/bonzanini/Book-SocialMediaMiningPython/tree/master/Chap04
  
  Cheers
  Marco
  
  LikeLike
  
  Reply
mochadwi says:

March 11, 2017 at 8:58 am

Hello, thanks for providing this tutorial for us @Marco

I want to know, is there any limitation by scrapping this?

I’ve tried multiple run

#1 terminal instances
python parse.py -q fashion -d data

#2 terminal instances
python parse.py -q online -d data

the second terminal instances shown “420” instead, is it limitation from Twitter Streams?

LikeLike

Reply
Marco says:

March 15, 2017 at 8:20 am

Hi,
yes the limitations are documented here: https://dev.twitter.com/rest/public/rate-limiting
Error 420 happens when you exceed these limitations.

Cheers,
Marco

LikeLiked by 1 person

Reply
1. mochadwi says:
  
  May 9, 2017 at 9:18 am
  
  Ah, that’s more like it!
  
  Is there any ways to get workaround of this limitation?
  
  LikeLike
  
  Reply
  1. Marco says:
    
    May 12, 2017 at 2:40 pm
    
    Not as far as I know, these limitations are imposed by twitter
    
    LikeLike
Inhyeok Yoo says:

April 27, 2017 at 11:49 am

Hi.
Thanks for nice post. It helps me a lot.
However I have a question.
Can I know the reason you use word_tokenizer instead of using TweetTokenize?

LikeLike

Reply
1. Marco says:
  
  May 9, 2017 at 7:48 am
  
  Hi
  the word_tokenize example is just to showcase that the “general purpose” tokenizer doesn’t work well with Twitter data. To perform tokenization we use the regex approach which is very similar to what the TweetTokenizer does (you can of course go straight to the TweetTokenizer, this is just to explain what happens under the hood)
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Kartikey Kant says:

June 25, 2017 at 9:37 am

Hey marco, thanks for the great series of articles, however Ive been facing a error in json.loads line in my code.
I get ValueError: Extra data: line 1 column 3115 – line 2 column 1 (char 3114 – 301245) on the line tweet = json.loads(line)

LikeLike

Reply
Pingback: 数据科学家如何找到心仪的工作？ - 数据分析网
wirawanrizkika says:

September 10, 2017 at 2:39 am

Hi Marco,
I would like to ask, what if my streamed twitter Data is on .txt format? how do i use the .txt to pre processing?

LikeLike

Reply
Charles Ekaluo says:

October 2, 2017 at 7:47 pm

with open(‘mytweets.json’, ‘r’) as f:
for line in f:
tweet = json.loads(line)
tokens = preprocess(tweet[‘text’])
do_something_else(tokens)

*i keep getting an error which says Preprocess os not defined! help!

* and what is supposed to be in the [text] list?

LikeLike

Reply
1. Marco says:
  
  October 3, 2017 at 7:51 am
  
  Hi Charles,
  the function preprocess is defined right above the snippet you’re quoting. The function “do_something_else” is just a placeholder for additional steps (if you don’t have any, you can remove it).
  The variable tweet is a dictionary, and tweet[‘text’] is a string with the content of the tweet itself. The structure of the dictionary is described in the paragraph “the anatomy of a tweet”
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
rajidhamon says:

November 24, 2017 at 5:17 am

Hi!! How can I do the same process if I have the tweets in a csv file

LikeLike

Reply
Rob says:

November 28, 2017 at 12:22 pm

Hi Marco,
great Tutorial!
Now that Twitter is working with 280 characters, the json file has a new structure. If the tweet is >140 characters, ‘text’ will only contain the first 140. The full tweet is then available in [‘extended_tweet’][‘full_text’]. I guess that has something to do with how tweepy operates, maybe they will change that soon. But for now I think there needs to be an additional loop to check if [‘extended_tweet’][‘full_text’] exists and if not just use the default ‘text’,

I’m not sure where my mistake is right now, but it’s not working for me so far. Any suggestions?

for line in f:
tweet = json.loads(line)
extended_text = tweet.get(‘extended_text’)
if extended_text:
full_text = extended_text.get(‘full_text’)
if full_text:
pass
else:
text = tweet.get(‘text’)

tokens = preprocess(tweet)

LikeLike

Reply
Kitwradr says:

January 6, 2018 at 2:52 pm

Can someone explain what are regex_str and emoticons_str , what are the formats in which these are declared

LikeLike

Reply
Shobhit says:

January 13, 2018 at 1:02 pm

I’m not able to open the JSON file with the tweets if it contains more than a tweet.

LikeLike

Reply
Tracy says:

January 19, 2018 at 1:42 pm

Hi Marco, Thanks you very much for your great tutorial. It will be great if i can get some advises from you.

I have used your piece of codes and i have made a classification of the collected tweets into positive and negative tweets. I have tried to have a data visualization of a keyword from the positive tweets but i get some problems.

Can you help me please? Thanks

LikeLike

Reply