Data Mining · NLP · Python

Mining Twitter Data with Python (Part 2: Text Pre-processing)

This is the second part of a series of articles about data mining on Twitter. In the previous episode, we have seen how to collect data from Twitter. In this post, we’ll discuss the structure of a tweet and we’ll start digging into the processing steps we need for some text analysis.

Table of Contents of this tutorial:

The Anatomy of a Tweet

Assuming that you have collected a number of tweets and stored them in JSON as suggested in the previous article, let’s have a look at the structure of a tweet:

import json

with open('mytweets.json', 'r') as f:
    line = f.readline() # read only the first tweet/line
    tweet = json.loads(line) # load it as Python dict
    print(json.dumps(tweet, indent=4)) # pretty-print

The key attributes are the following:

  • text: the text of the tweet itself
  • created_at: the date of creation
  • favorite_count, retweet_count: the number of favourites and retweets
  • favorited, retweeted: boolean stating whether the authenticated user (you) have favourited or retweeted this tweet
  • lang: acronym for the language (e.g. “en” for english)
  • id: the tweet identifier
  • place, coordinates, geo: geo-location information if available
  • user: the author’s full profile
  • entities: list of entities like URLs, @-mentions, hashtags and symbols
  • in_reply_to_user_id: user identifier if the tweet is a reply to a specific user
  • in_reply_to_status_id: status identifier id the tweet is a reply to a specific status

As you can see there’s a lot of information we can play with. All the *_id fields also have a *_id_str counterpart, where the same information is stored as a string rather than a big int (to avoid overflow problems). We can imagine how these data already allow for some interesting analysis: we can check who is most favourited/retweeted, who’s discussing with who, what are the most popular hashtags and so on. Most of the goodness we’re looking for, i.e. the content of a tweet, is anyway embedded in the text, and that’s where we’re starting our analysis.

We start our analysis by breaking the text down into words. Tokenisation is one of the most basic, yet most important, steps in text analysis. The purpose of tokenisation is to split a stream of text into smaller units called tokens, usually words or phrases. While this is a well understood problem with several out-of-the-box solutions from popular libraries, Twitter data pose some challenges because of the nature of the language.

How to Tokenise a Tweet Text

Let’s see an example, using the popular NLTK library to tokenise a fictitious tweet:

from nltk.tokenize import word_tokenize

tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP'
print(word_tokenize(tweet))
# ['RT', '@', 'marcobonzanini', ':', 'just', 'an', 'example', '!', ':', 'D', 'http', ':', '//example.com', '#', 'NLP']

You will notice some peculiarities that are not captured by a general-purpose English tokeniser like the one from NLTK: @-mentions, emoticons, URLs and #hash-tags are not recognised as single tokens. The following code will propose a pre-processing chain that will consider these aspects of the language.

import re

emoticons_str = r"""
    (?:
        [:=;] # Eyes
        [oO\-]? # Nose (optional)
        [D\)\]\(\]/\\OpP] # Mouth
    )"""

regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs

    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
   
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)

def tokenize(s):
    return tokens_re.findall(s)

def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

tweet = 'RT @marcobonzanini: just an example! :D http://example.com #NLP'
print(preprocess(tweet))
# ['RT', '@marcobonzanini', ':', 'just', 'an', 'example', '!', ':D', 'http://example.com', '#NLP']

As you can see, @-mentions, emoticons, URLs and #hash-tags are now preserved as individual tokens.

If we want to process all our tweets, previously saved on file:

with open('mytweets.json', 'r') as f:
    for line in f:
        tweet = json.loads(line)
        tokens = preprocess(tweet['text'])
        do_something_else(tokens)

The tokeniser is probably far from perfect, but it gives you the general idea. The tokenisation is based on regular expressions (regexp), which is a common choice for this type of problem. Some particular types of tokens (e.g. phone numbers or chemical names) will not be captured, and will be probably broken into several tokens. To overcome this problem, as well as to improve the richness of your pre-processing pipeline, you can improve the regular expressions, or even employ more sophisticated techniques like Named Entity Recognition.

The core component of the tokeniser is the regex_str variable, which is a list of possible patterns. In particular, we try to capture some emoticons, HTML tags, Twitter @usernames (@-mentions), Twitter #hashtags, URLs, numbers, words with and without dashes and apostrophes, and finally “anything else”. Please take a moment to observe the regexp for capturing numbers: why don’t we just use \d+? The problem here is that numbers can appear in several different ways, e.g. 1000 can also be written as 1,000 or 1,000.00 — and we can get into more complications in a multi-lingual environment where commas and dots are inverted: “one thousand” can be written as 1.000 or 1.000,00 in many non-anglophone countries. The task of identifying numeric tokens correctly just gives you a glimpse of how difficult tokenisation can be.

The regular expressions are compiled with the flags re.VERBOSE, to allow spaces in the regexp to be ignored (see the multi-line emoticons regexp), and re.IGNORECASE to catch both upper and lowercases. The tokenize() function simply catches all the tokens in a string and returns them as a list. This function is used within preprocess(), which is used as a pre-processing chain: in this case we simply add a lowercasing feature for all the tokens that are not emoticons (e.g. :D doesn’t become :d).

Summary

In this article we have analysed the overall structure of a tweet, and we have discussed how to pre-process the text before we can get into some more interesting analysis. In particular, we have seen how tokenisation, despite being a well-understood problem, can get tricky with Twitter data. The proposed solution is far from perfect but it’s a good starting point, and fairly easy to extend.

@MarcoBonzanini

Table of Contents of this tutorial:

76 thoughts on “Mining Twitter Data with Python (Part 2: Text Pre-processing)

  1. At first, thank you for this very useful article. I am a student of University of Tsukuba, Japan, and I’m doing the research about Twitter for analyzing the features of celebrity users’ tweets. I think your article could help a lot, but I cannot see all the codes, just few lines. Could you tell me how to see all the codes? Thanks again.

    Like

    1. Thanks Raju, that’s correct, the approach is also very similar, with a long list of regex to capture different types of token — I just simplified it a bit for readability

      Cheers
      Marco

      Like

  2. Hello Marco, I’ve got error. The message is “NameError: name ‘re’ is not defined”
    What does ‘re’ mean in you code here “re.compile.. bla..bla..bla” ?
    Do I need to import something?
    Thank you

    Like

  3. Hello Marco

    Thanks so much for this nice presentation and it is really helpful.I though have some problems with running the code specially when I want to read the json file my directory.

    with open(‘company.json, ‘r’) as f:
    for line in f:
    tweet = json.loads(line)
    tokens = preprocess(tweet[‘text’])
    do_something_else(tokens)

    it does not know either line,preprocess and do_something_else although preprocesss and tokens and already defined.Would you please release the whole code fo this section so I can fix my mistakes?

    Thanks

    Like

    1. Hi Alma, the do_something_else() function is just a place-holder for you to implement depending on your application needs. You can for example have a look at the third article (Term frequencies) to see how to do some basic frequency analysis.
      Cheers,
      Marco

      Like

  4. Hi.. I have the tweets text stored in a pandas dataframe in column text. When I use like this: print(preprocess(dataframe[‘text’])).. I get an error…”expected string or buffer”. How do I convert the text column of dataframe into text. Its already into text. Pls help

    Like

    1. Hi Jigar. dataframe[‘text’] is not a string, but rather a list of strings (to be more precise, a pandas.Series which is an array-like object, where each item is a string). So you simply need to iterate through it:

      for text in dataframe['text']:
          print(preprocess(text))
      

      Cheers,
      Marco

      Like

  5. Great Article :)

    Maybe you could help me out :)
    My problem is the JSON i guess when i run the code with

    with open(‘mytweets.json, ‘r’) as f:
    for line in f:
    tweet = json.loads(line)
    tokens = preprocess(tweet[‘text’])

    I get an error:
    File “.\testing.py”, line 50
    with open(‘stream_python.json, ‘r’) as f:
    ^
    SyntaxError: EOL while scanning string literal

    //Note i have an json file containing several tweets with an row between them
    //Note2 I have downloaded twitter_stream_downloader.py and saved some tweets of #facebook in an file then i tried your solutions on Text Pre-processing but i get the error mentioned above?
    So could this be a problem from the free row between each tweet in my json file?
    NOte3 When i go to an website to “print” my json it says : There are multiple root elements in the stream_python.json file
    So the error is caused that my json file is not “good” or?

    Thank you for your awesome article

    Like

    1. with open(‘stream_python.json, ‘r’) as f:….u r missing a close single inverted comma at the end of file name…… ‘stream_python.josn’ …it will work!

      CHEERS Jigar

      Like

    2. Hi, just to add to what Jigar has already mentioned, regarding note 3: the file is not in JSON format, but rather in JSON Lines (http://jsonlines.org/). With JSON Lines, each line is a valid JSON (and there are no empty lines). So if you try to load the whole file, you don’t have a single JSON document, hence the “multiple root elements” error

      Cheers,
      Marco

      Like

  6. Hi. Thanks for this wonderful guide :-) and hello from Denmark.

    I have gone through installation of 32-bit Anaconda and I use iPython and I AM very noobish, since I get an error when I input this code:

    import operator
    import json
    from collections import Counter

    fname = ’26-Oktober-data.json’
    with open(fname, ‘r’) as f:
    count_all = Counter()
    for line in f:
    tweet = json.loads(line)
    # Create a list with all the terms
    terms_all = [term for term in preprocess(tweet[‘text’])]
    # Update the counter
    count_all.update(terms_all)
    # Print the first 5 most frequent words
    print(count_all.most_common(5))

    TypeError: ‘None-type’ object is not iterable

    I have got a json-file named 26-Oktober-data.json with about 14000 tweets that I would like to manipulate, but I keep getting this error. I actually DID manage to get it to work the first time I tried it (Python told me which tokens were used the most), but when I made a repetition an hour later, to learn it again, I got the above error.

    What am I missing here? :-)

    Like

  7. Hi Marco;
    Thank you very much for this great article, but It seems like I have a problem with this line : ” tweet = json.loads(line)”
    This is the error I get:
    ValueError: Expecting value: line 2 column 1 (char 1).
    What’s wrong? could you please help me?

    Like

    1. I have figured out what caused MY error:

      The .json file I used had 3 lines of spaces at the bottom of the file – I deleted the three lines and now I get no error when I parse it :)

      Like

      1. Hi Jeppe, good to know that you got this sorted out. Just to expand on this, the file is meant to be in JSON Lines format (http://jsonlines.org), where each line is supposed to be a proper JSON document

        Cheers, Marco

        Like

  8. Hi Marco;
    Thank you very much for this great article, but It seems like I have a problem with this line : ” tweet = json.loads(line)”
    This is the error I get:
    ValueError: Expecting value: line 2 column 1 (char 1).
    What’s wrong? could you please help me?

    Like

    1. Hi jiizaa, the error is very likely to be caused by an empty line in the file. The file collected (using the streaming method in article 1) should be in JSON Lines format (http://jsonlines.org) where each line is meant to be a proper JSON document. The json.loads() function will raise the same ValueError also in case of invalid string. Solutions: fix the file, or put that line in a try/except block to capture the exception.

      Hope this helps,
      Marco

      Like

      1. Hi Marco

        Thank you so much for replying!

        It baffles me how I could get it to work the first time, but now I get this error:
        TypeError: ‘NoneType’ object is not iterable

        when executing this code:
        fname = ‘vwdata.json’
        with open(fname, ‘r’) as f:
        count_all = Counter()
        for line in f:
        tweet = json.loads(line)
        tokens = preprocess(tweet[‘text’])
        terms_all = [term for term in preprocess(tweet[‘text’])]
        count_all.update(terms_all)
        print(count_all.most_common(5))

        What is wrong with this statement?

        Like

      2. It’s this line that is causing the problem:
        terms_all = [term for term in preprocess(tweet[‘text’])]

        Your help is truly appreciated!

        Like

      3. You need to investigate why the preprocess() functions is returning None occasionally. If you call it manually from outside the loop, does it return a list?

        Like

      4. I do not know how to call the function from outside the loop – if I write:
        preprocess()

        in the terminal, I get this error:
        preprocess() missing 1 required positional argument: ‘s’

        What argument do I need to pass for this to work?

        Like

      5. Ok, I have now tried to pass a string with the preprocess() function – writing:

        preprocess(‘This is a string of text for testing.’)

        and it returns… nothing. The command line just accepts the input and gives me a new line for inputting the next command. No list is generated regardless of what string I pass in preprocess()

        Like

      6. Hello Marco, First of all thanx for such a helpful blog….
        i am also getting this same error “json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)”
        on this line “tweet = json.loads(line)”
        you said we have to have these files in jason lines format…but i did not get it, and also couldn’t find anything useful on the web…will you please help me out with this….thanx in advance..:)

        Like

  9. Thanks for this tutorial. I preprocessed the data in json file. My question is if there is a better way to handle non-English tweets. For example, one tweet is shown as following:

    [u’QR’, u’\u30b3′, u’\u30fc’, u’\u30c9′, u’\u3092′, u’\u7d20′, u’\u65e9′, u’\u304f’, u’\u8a8d’, u’\u8a3c’, u’\u3057′, u’\u3001′, u’\u5185′, u’\u8535′, u’\u30d6′, u’\u30e9′, u’\u30a6′, u’\u30b6′, u’\u30fc’]

    Like

    1. Hi, sorry for the late reply. Tokenization is specific for a language, or at least for a family of languages. If you’re mostly dealing with non-latin alphabets, you’d better look for a tokenization library which deals with your target language. I don’t have a specific recommendation but there are open source options out there (e.g. pyMMSeg for Chinese, TinySegmenter for Japanese, etc.)
      Cheers, Marco

      Like

  10. When I try to load the json file that I pulled from the streaming twitter api i get a
    ValueError: Expecting value: line 2 column 1 (char 1)

    I guess this is because twitter streaming API json files are newline-delimited, and the blank line is messing up the json.load() method? Is there a simple way around this?

    Thanks,

    Like

    1. Hi Dave, there’s some discussion about the file format and this particular error in the comments above (look for “ValueError” on this page)
      Cheers,
      Marco

      Like

      1. Hi Marco,

        Yeah sorry, I did see that afterwards. Here is some code I wrote that solved this problem for me in case anyone else needs a quick fix.

        with open(fname, ‘r’) as f:
        count_all = Counter()
        for line in f:
        try:
        tweet = json.loads(line)
        except ValueError:
        pass

        More specifically though I was wondering why something like this requires a “work around”. I see the value in having a file of json objects as opposed to a json file, but is there a more elegant way to preprocess these files?

        Thanks for the help, love your posts.

        Like

    2. I think in general the original code can be improved to be made more robust, e.g. to handle these little bumps in the first place, but for the sake of clarity/brevity I’ve skipped some try/except here and there in the explanation. These could be added either before processing the file, to ensure it’s in a clean format, or during processing as you’ve done. I’m not sure you can get more elegant than a single try/except block :)
      Cheers,
      Marco

      Like

  11. Hi-, Marco, I am from WPI. it is interesting to see your post. I try to follow your code.
    But I run into error at terms_all = [term for term in preprocess(tweet[‘text’])]
    the error info says TypeError: list indices must be integers, not str. Would you please take a look where it went wrong? thanks a lot

    import operator
    import json
    from collections import Counter
    from nltk.tokenize import word_tokenize
    import re

    emoticons_str = r”””
    (?:
    [:=;] # Eyes
    [oO\-]? # Nose (optional)
    [D\)\]\(\]/\\OpP] # Mouth
    )”””

    regex_str = [
    emoticons_str,
    r’]+>’, # HTML tags
    r'(?:@[\w_]+)’, # @-mentions
    r”(?:\#+[\w_]+[\w\’_\-]*[\w_]+)”, # hash-tags
    r’http[s]?://(?:[a-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+’, # URLs

    r'(?:(?:\d+,?)+(?:\.?\d+)?)’, # numbers
    r”(?:[a-z][a-z’\-_]+[a-z])”, # words with – and ‘
    r'(?:[\w_]+)’, # other words
    r'(?:\S)’ # anything else
    ]

    tokens_re = re.compile(r'(‘+’|’.join(regex_str)+’)’, re.VERBOSE | re.IGNORECASE)
    emoticon_re = re.compile(r’^’+emoticons_str+’$’, re.VERBOSE | re.IGNORECASE)

    def tokenize(s):
    return tokens_re.findall(s)

    def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
    tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

    fname = ‘Election.json’
    with open(fname, ‘r’) as f:
    count_all = Counter()
    for line in f:
    tweet = json.loads(line)
    # Create a list with all the terms
    terms_all = [term for term in preprocess(tweet[‘text’])]
    # Update the counter
    count_all.update(terms_all)
    # Print the first 5 most frequent words
    print(count_all.most_common(5))

    Like

  12. Hi-, Marco,
    I got error at the following. Am I missing something?
    terms_all = [term for term in preprocess(tweet[‘text’])]
    the error is
    list indices must be integers, not str
    thanks

    Like

    1. Hi wenlei
      check the format of your Election.json file: each line must be a correct json document, no empty lines. You have the error when e.g. the “tweet” variable is not a dictionary as expected
      Cheers,
      Marco

      Like

  13. thanks a lot, Marco. first, I apologize for post question twice. I did not see the first one get posted.
    You are right. my election.json is generated by other twitter package. that could be not go line by line. I have tried your way using canopy which only has tweepy 2.1. It hang there and I did not see file generated. Do you see any work around if I use the existing json?
    again thank you very much.

    Like

    1. You could simply wrap the line that is causing the problem in a try/except block, e.g. something similar to:

      try:
          terms_all = [term for term in preprocess(tweet[‘text’])]
      except TypeError:
          pass
      

      Another option is to clean the json file beforehand
      Cheers,
      Marco

      Like

  14. Hi Marco! A really nice guide, but unfortunately I must have missed something!

    I store data with
    “class MyListener(StreamListener):

    def on_data(self, data):
    try:
    with open(‘Saved data/’ + filename + ‘.json’, ‘a’) as outfile:
    outfile.write(data)
    return True
    except BaseException as e:
    print(‘Error on_data: %s’ % str(e))
    return True

    def on_error(self, status):
    print(status)
    return True

    if stream:
    twitter_stream = Stream(auth, MyListener())
    twitter_stream.filter(track=[tracking_string])”

    And tries to preprocesses it with
    “with open(‘saved data/’ + filename + ‘.json’, ‘r’) as infile:
    for line in infile:
    tweet = json.loads(line)
    tokens = preprocess(tweet[‘text’])
    print(‘mark’)

    My json file has more than one line but it always stops after the first loop (it prints “mark” one time but then gives the error mentioned above with error @ “tweet = json.loads(line)” and “Expecting value: line 2 column 1 (char 1)”. I guess that the code doesn’t get the second line but I’ve opened it to check the format and it seems right (no blank lines etc.). Any idéa what can be wrong?

    Thanks in advance

    Like

  15. Hi! Your blog is very helpful to newbies like me :) I’ve got a problem though. Cause I streamed some data from tweeter (which I learned from your Part1) problem or difference is I saved it in a txt file. Now I’m trying to do this part 2 and it’s giving me this error: line 40, in
    tokens = preprocess(tweet[‘text’])
    KeyError: ‘text’
    Should I be replacing the text with a different thing? Or is it because I saved my tweets in a txt file? Hope to get a help from you :)
    It’s really good to know there’s a lot of help like this around.

    Like

    1. Hi Joy Tagle – I would like to answer that

      You are on to something here… Because you have to save your data in a .json format and not text.

      If the syntax in the file is correct, you can simply rename the extension (.txt) to .json – that Should do the trick, and afterwards you should be able to “preprocess” it… If not: please don’t hesitate to write here again.

      Best regards,
      Jeppe

      Like

    2. Hi Joy,
      the file extension per se doesn’t really matter. The important aspect is the format of the data: each line has to be a proper JSON document, one document per line, no empty lines. Also, make sure you’ve closed the stream so you’re not writing on the file while you attempt to read it. You can also print the tweet before calling preprocess if it helps debugging.
      Cheers,
      Marco

      Like

      1. Hey macro what if i want to apply all this on the data saved on database rather than json format.
        kindly reply

        Like

  16. Thanks Marco for an amazing document. I can help clear the issues some people have as I have faced them myself. If you have produced .json files from Tweepy or any other twitter streaming api just make sure that your .json file has continuous lists as objects. Simply, get rid of any newline(or whitespace) that occurs between these tweets and you’re good to go with the code. The crux of the problem is understanding what json.dumps does and understanding that the .json file needs to be without whitespace or newlines.
    You could refer to this->
    https://docs.python.org/3/library/json.html
    http://jsonlines.org/

    Like

  17. is there any ideas how to clean my data from those “\u0435”
    [(u’#news’, 1068), (u’\u043e’, 798), (u’\u0438′, 623), (u’\u0430′, 613), (u’\u0435′, 593), (u’\u043d’, 519), (u’#News’, 464), (u’\u0441′, 451), (u’\u0442′, 443), (u’\u0432′, 430), (u’\u0440′, 400), (u’\u3057′, 372), (u’\u30fc’, 355), (u’\u3044′, 344), (u’RT’, 302)]

    Like

    1. I’m not quite sure what you mean Kel3vra. But if you seek to remove unicode cahracters from within the tweet itself you can use regex —
      temp_str=tweet[‘text’]
      #regex to remove unicode
      temp_str = re.sub(r'[^\x00-\x7F]+’,”, temp_str)

      Like

  18. Hi Marco – Its really helpful, Just 1 question from my side, Do you think instead of tokenizer

    tweet.split( ‘ ‘) can be a better option, bec I see its almost working as re expression

    Like

    1. Hi Manu, using split() over whitespaces works only for toy examples. As soon as you have some punctuation, you’re going to miss it. You can also look into nltk.tokenize.TweetTokenizer if you don’t need any particular customisation, it works pretty well

      Cheers,
      Marco

      Like

  19. Hi Marco,
    Thank you very much for this enormously useful step-by-step guide. I have a question about the preprocess function – is this supposed to convert all tokens other than emoticons to lower case? If that is the case, why are ‘RT’ and ‘#NLP still in upper case? I see the same thing while implementing this code with my file.

    print(preprocess(tweet))
    # [‘RT’, ‘@marcobonzanini’, ‘:’, ‘just’, ‘an’, ‘example’, ‘!’, ‘:D’, ‘http://example.com’, ‘#NLP’]

    Thanks!
    Rekha

    Like

    1. Hi Rekha, if you see the definition of the preprocess() function, the lowercasing is optional and it defaults to False, e.g. try preprocess(tweet, lowercase=True)
      Cheers,
      Marco

      Like

    1. Hi Shwet, the problem is probably given by the json file, please have a look at the comments above (you might have for example empty lines in your file). Another possible source of problem is the new line character used in Windows, so you might need to open your file with:

      with open(fname, 'r', newline='\r\n') as f:

      Cheers,
      Marco

      Liked by 2 people

  20. Hi Marco,

    Thank you for the great tutorial. I have a question about using twitter data. When I test preprocess() function with a normal text I type, it works great as you did. However, when I apply it to my twitter data, there is an extra ” u’ ” in front of every words. Is it some mechanism or rule for twitter data or something went wrong?

    Like

    1. Hi Alex, the ‘u’ in front of strings stands for Unicode in Python 2. These examples have been tested in Python 3 so there might be some hiccup here and there if you run them on Python 2. Anyway this is one of the differences between Python 2 and Python 3: in Py2, you have the data type str which holds ASCII strings and a separate unicode data type, while in Py3 the data types have been unified so there’s a str that holds unicode text. There’s a lot of documentation online about this topic in case you stumble on some encoding/decoding issue.

      Cheers,
      Marco

      Like

  21. Hi Marco,

    Thank you so much for this info.

    I’m currently doing text pre-processing of facebook dataset mostly comments.
    How can I use this code for facebook data?

    Thank you,

    Best cheers,
    Isaac

    Like

  22. Hello, thanks for providing this tutorial for us @Marco

    I want to know, is there any limitation by scrapping this?

    I’ve tried multiple run

    #1 terminal instances
    python parse.py -q fashion -d data

    #2 terminal instances
    python parse.py -q online -d data

    the second terminal instances shown “420” instead, is it limitation from Twitter Streams?

    Like

  23. Hi.
    Thanks for nice post. It helps me a lot.
    However I have a question.
    Can I know the reason you use word_tokenizer instead of using TweetTokenize?

    Like

    1. Hi
      the word_tokenize example is just to showcase that the “general purpose” tokenizer doesn’t work well with Twitter data. To perform tokenization we use the regex approach which is very similar to what the TweetTokenizer does (you can of course go straight to the TweetTokenizer, this is just to explain what happens under the hood)

      Cheers,
      Marco

      Like

  24. Hey marco, thanks for the great series of articles, however Ive been facing a error in json.loads line in my code.
    I get ValueError: Extra data: line 1 column 3115 – line 2 column 1 (char 3114 – 301245) on the line tweet = json.loads(line)

    Like

  25. with open(‘mytweets.json’, ‘r’) as f:
    for line in f:
    tweet = json.loads(line)
    tokens = preprocess(tweet[‘text’])
    do_something_else(tokens)

    *i keep getting an error which says Preprocess os not defined! help!

    * and what is supposed to be in the [text] list?

    Like

    1. Hi Charles,
      the function preprocess is defined right above the snippet you’re quoting. The function “do_something_else” is just a placeholder for additional steps (if you don’t have any, you can remove it).
      The variable tweet is a dictionary, and tweet[‘text’] is a string with the content of the tweet itself. The structure of the dictionary is described in the paragraph “the anatomy of a tweet”

      Cheers,
      Marco

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s