Mining Twitter Data with Python (Part 1: Collecting data)

Twitter is a popular social network where users can share short SMS-like messages called tweets. Users share thoughts, links and pictures on Twitter, journalists comment on live events, companies promote products and engage with customers. The list of different ways to use Twitter could be really long, and with 500 millions of tweets per day, there’s a lot of data to analyse and to play with.

This is the first in a series of articles dedicated to mining data on Twitter using Python. In this first part, we’ll see different options to collect data from Twitter. Once we have built a data set, in the next episodes we’ll discuss some interesting data applications.

Update July 2016: my new book on data mining for Social Media is out! Part of the content in this tutorial has been improved and expanded as part of the book, so please have a look. Chapter 2 about mining Twitter is available as a free sample from the publisher’s web site, and the companion code with many more examples is available on my GitHub

Table of Contents of this tutorial:

More updates: fixed version number of Tweepy to avoid problem with Python 3; fixed discussion on _json to get the JSON representation of a tweet; added example of process_or_store().

Register Your App

In order to have access to Twitter data programmatically, we need to create an app that interacts with the Twitter API.

The first step is the registration of your app. In particular, you need to point your browser to http://apps.twitter.com, log-in to Twitter (if you’re not already logged in) and register a new application. You can now choose a name and a description for your app (for example “Mining Demo” or similar). You will receive a consumer key and a consumer secret: these are application settings that should always be kept private. From the configuration page of your app, you can also require an access token and an access token secret. Similarly to the consumer keys, these strings must also be kept private: they provide the application access to Twitter on behalf of your account. The default permissions are read-only, which is all we need in our case, but if you decide to change your permission to provide writing features in your app, you must negotiate a new access token.

Important Note: there are rate limits in the use of the Twitter API, as well as limitations in case you want to provide a downloadable data-set, see:

Accessing the Data

Twitter provides REST APIs you can use to interact with their service. There is also a bunch of Python-based clients out there that we can use without re-inventing the wheel. In particular, Tweepy in one of the most interesting and straightforward to use, so let’s install it:

pip install tweepy==3.3.0

Update: the release 3.4.0 of Tweepy has introduced a problem with Python 3, currently fixed on github but not yet available with pip, for this reason we’re using version 3.3.0 until a new release is available.

More Updates: the release 3.5.0 of Tweepy, already available via pip, seems to solve the problem with Python 3 mentioned above.

In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:

import tweepy
from tweepy import OAuthHandler

consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

The api variable is now our entry point for most of the operations we can perform with Twitter.

For example, we can read our own timeline (i.e. our Twitter homepage) with:

for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    print(status.text) 

Tweepy provides the convenient Cursor interface to iterate through different types of objects. In the example above we’re using 10 to limit the number of tweets we’re reading, but we can of course access more. The status variable is an instance of the Status() class, a nice wrapper to access the data. The JSON response from the Twitter API is available in the attribute _json (with a leading underscore), which is not the raw JSON string, but a dictionary.

So the code above can be re-written to process/store the JSON:

for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    process_or_store(status._json) 

What if we want to have a list of all our followers? There you go:

for friend in tweepy.Cursor(api.friends).items():
    process_or_store(friend._json)

And how about a list of all our tweets? Simple:

for tweet in tweepy.Cursor(api.user_timeline).items():
    process_or_store(tweet._json)

In this way we can easily collect tweets (and more) and store them in the original JSON format, fairly easy to convert into different data models depending on our storage (many NoSQL technologies provide some bulk import feature).

The function process_or_store() is a place-holder for your custom implementation. In the simplest form, you could just print out the JSON, one tweet per line:

def process_or_store(tweet):
    print(json.dumps(tweet))

Streaming

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. We need to extend the StreamListener() to customise the way we process the incoming data. A working example that gathers all the new tweets with the #python hashtag:

from tweepy import Stream
from tweepy.streaming import StreamListener

class MyListener(StreamListener):

    def on_data(self, data):
        try:
            with open('python.json', 'a') as f:
                f.write(data)
                return True
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True

    def on_error(self, status):
        print(status)
        return True

twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['#python'])

Depending on the search term, we can gather tons of tweets within a few minutes. This is especially true for live events with a world-wide coverage (World Cups, Super Bowls, Academy Awards, you name it), so keep an eye on the JSON file to understand how fast it grows and consider how many tweets you might need for your tests. The above script will save each tweet on a new line, so you can use the command wc -l python.json from a Unix shell to know how many tweets you’ve gathered.

You can see a minimal working example of the Twitter Stream API in the following Gist:

twitter_stream_downloader.py

Summary

We have introduced tweepy as a tool to access Twitter data in a fairly easy way with Python. There are different types of data we can collect, with the obvious focus on the “tweet” object.

Once we have collected some data, the possibilities in terms of analytics applications are endless. In the next episodes, we’ll discuss some options.

@MarcoBonzanini

Table of Contents of this tutorial:

Published by

Marco

Data Scientist

91 thoughts on “Mining Twitter Data with Python (Part 1: Collecting data)”

  1. Very nice post, as usual. I am actually using part of this code to build a small application to download all the images of a twitter account. I only one small comment, for the sake of completion, I believe you should also import json in your example.

    Liked by 1 person

    1. Hi Bozhidar,
      the function is not defined in the code sample (hence the error), it’s just a placeholder for you to personalise depending on your needs (storing the data, do some pre-processing as described in Part 2, etc.). In the simplest form, you can just substitute it with a print() and dump all the JSON in a plain file

      Like

      1. I tried to solve it this way and i get this error: AttributeError: ‘Status’ object has no attribute ‘json’
        can you help me!

        Like

  2. Am working on a project using Python & Twitter and I chanced upon your site! Really cool & very useful! My interests lie in data science too, so I’ll be back :)

    Like

  3. Hi Marco,

    Really great intro. Don’t want to take up your time debugging, but was wondering if you know how to add a timer to the data collection process. I tried the following….

    if (time.time() – start_time)/60 <= 1:
    twitter_stream = Stream(auth, MyListener())
    twitter_stream.filter(track=['#python'])
    else:
    print("—collection terminated at %s seconds —") %round((time.time() -start_time),2)
    import sys
    sys.exit()

    It runs, but doesn't stop after 1 minute like intended. Any ideas?

    Like

    1. hi basilspike,

      with the if/else described in that way, the else branch is never executed (so the script doesn’t stop).

      One option is to check the timer in the on_data() method, for example before saving/printing the tweets you could do:

      if (time.time() – self.start_time)/60 >= 1:
      return False

      You need to set self.start_time in the __init__ method. Please notice that in this case the script won’t stop after one minute, but only when you receive the first tweet that arrives after the one-minute timeout (could be a big difference for low volume streams).

      Another option would be to use a subprocess for the download, and terminate it after the timeout is gone.

      Like

  4. Hey !

    This is the error I’m encountering on running the StreamListener code:

    C:\Python27\lib\site-packages\requests-2.7.0-py2.7.egg\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
    InsecurePlatformWarning
    C:\Python27\lib\site-packages\requests-2.7.0-py2.7.egg\requests\packages\urllib3\connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
    SecurityWarning

    What does this error signify? And how can I debug this error?

    Like

    1. Hi Manas,
      probably you’ll need to upgrade to a newer Python version, please have a look at the two links in the warning message you posted (the first one in particular)

      Like

      1. Hey Marco, thanks a lot for this tutorial. How do I delete the lines from the json file after “n” minutes and keep streaming. I want to do this in time windows. For example:

        1. Start the stream.
        2. After 10 minutes, delete the lines from the json file.
        3. Start writing again the tweets.
        4. After 10 minutes, delete the lines again.
        5. Start writing again the tweets.
        6. Repeat 2 and 3 in kind of loop (I don’t know how to call it)

        I want to do this because I want to clean my json file after an amount of time.

        Ps: Sorry for my english. I’m just learning.

        Like

  5. Hi Catalin

    Change the following lines in streaming.py
    l161: self._buffer += self._stream.read(read_len).decode(‘ascii’)
    l171: self._buffer += self._stream.read(self._chunk_size).decode(‘ascii’)

    This fixes it for python3.4, I do not know about python2.7

    Like

    1. Hi Catalin, hi che and thanks for your input.

      To expand on this, the problem is introduced with version 3.4.0 of tweepy for Python 3. Version 3.3.0 of tweepy, the one used when this article was written, is immune from this problem on Python 3.

      The issue is open on github: https://github.com/tweepy/tweepy/issues/615
      The workaround suggested by che works for Python 3, but if you don’t want to tamper with the libraries you could also simply downgrade tweepy, e.g.
      pip install -I tweepy==3.3.0

      Like

  6. Thanks Marco for your twitter answer. However, I am new in Python, and I would need a bit more of information. For now I am trying different packages, but I have difficulties making them work. I want to ask you what I need to run code such as Tweepy. I have Anaconda, Pycharm, text wrangler but I cannot find the way to run the code. I have installed tweetpy through the terminal, and when I try to use it in terminal, pycharm or anaconda, it gives different errors. Probably because python is not well set up, I am not using IDE properly, or any other beginner’ reason

    Like

    1. Hi Javier, thanks for following up here.
      So from the command line, firstly check that you can invoke the python interpreter correctly, e.g. typing “python –version” (it should give you 3.4.3 if you have the latest anaconda for python 3).
      Anaconda doesn’t come with a package for tweepy, so if you try “conda install tweepy” you’ll see an error. You can anyway install it with “pip install tweepy==3.3.0”.
      I’ve updated the post with a small example of how to use the stream API posted on github: https://gist.github.com/bonzanini/af0463b927433c73784d
      so you can just save the file and run it (assuming you have the right credentials/tokens as explained in the article)

      Like

    1. Hi, 401 is the status code for “unauthorized”. Usually this problem is related to either missing or incorrect credentials (the OAuth part at the beginning of the article), so that’s where to look first.

      Like

      1. Hey Marco, thank you for your replay

        The problem is when I executed the previous commands such as user_timeline, friends, and home_timeline .. those are working well and give me right output using the same credentials. When I use it in the streaming part, it give me 401?

        Like

  7. Hi Marco – I’m rather new at working with API’s, and your rundown of how to access Twitter data is of great help. Thanks!

    I have one question though. It seems to me that the data is already available in JSON. By making use of the ._json key that is directly available through Tweepy, you don’t have to define and apply the parse method:

    from pprint import pprint
    for status in tweepy.Cursor(api.user_timeline).items():
    … pprint (status._json)

    Correct me if I’m wrong – I’m by no means an expert in this field.

    Like

    1. Hi Nicolai, that’s correct and it’s something I have to fix in the article, it’s been in the pipeline for a while. It’s worth mentioning that the _json attribute is a dict, not the raw JSON string. Thanks for your input.

      Cheers
      Marco

      Like

  8. Hi Macro,

    Thanks for the great article. But, I see that place field has null values. I also want to fetch the country or location fom where the text is being tweeted. Pls help.

    Like

    1. Hi Jigar, places and coordinates are given optionally by the users and are often omitted. Unfortunately, only a small share of tweets has these data set explicitly. Probably you’ll need to collect more data to see something interesting.
      Cheers,
      Marco

      Like

  9. Hi Marco
    When I wrote the first command line “import weepy” from python, I got the error
    Traceback (most recent call last):
    File “”, line 1, in
    ImportError: No module named ‘tweepy’

    I upgraded to python3 and it still gives same error, any clue why this happens?
    Thanks
    Deena

    Like

    1. This is what I got when I was downloading tweepy thru sudo pip install tweepy
      Downloading/unpacking tweepy
      Downloading tweepy-3.4.0.tar.gz
      Running setup.py egg_info for package tweepy
      Traceback (most recent call last):
      File “”, line 14, in
      File “/Users/deena/build/tweepy/setup.py”, line 17, in
      install_reqs = parse_requirements(‘requirements.txt’, session=uuid.uuid1())
      TypeError: parse_requirements() got an unexpected keyword argument ‘session’
      Complete output from command python setup.py egg_info:
      Traceback (most recent call last):

      File “”, line 14, in

      File “/Users/deena/build/tweepy/setup.py”, line 17, in

      install_reqs = parse_requirements(‘requirements.txt’, session=uuid.uuid1())

      TypeError: parse_requirements() got an unexpected keyword argument ‘session’

      Like

  10. Hi Marco
    I tried to write the command import tweepy from python 2.7 but it gave me the error
    Traceback (most recent call last):
    File “”, line 1, in
    ImportError: No module named ‘tweepy’
    I upgraded to python3, still same error. Any clue why this happens?
    Thanks
    Deena

    Like

    1. Hi deena, the examples are tested on Python 3. You should be able to install the correct version on Tweepy with:
      pip install tweepy==3.3.0
      (I’ve updated the article with an explanation on the version numbers)
      In general it’s more sensible to use a virtualenv rather than sudo

      Like

  11. hello everyone, i have to go through the process of collecting tweets for a certain period (1/2015-9/2015) for my thesis, reading the article i started to register and create a new app but i have no website and the field is required. any suggestions?

    thank you in advance

    Like

    1. Hi Christina, the website is required but you can put a placeholder (as suggested also by Twitter). For example your twitter handle, or your github page, will do

      Like

  12. Marco another question,

    i tried to install tweepy. i got stuck in cmd when i type python setup.py install i get this: AttributeError: ‘str’ object has no attribute ‘req’.

    Like

    1. Hi, I’d recommend to use pip with a virtualenv to install the libraries, rather than using the bleeding-edge version from github — and keep in mind that the code in these articles has been tested on Python 3.4

      Cheers, Marco

      Like

      1. thank you very much Marco, your answers are very helpfull. Another mabe silly question. can i run the code with python IDLE? Im at the point where i start to access the data with the OAuth interface

        Like

  13. hello everyone! i would like to ask why when retreiving tweets i have some repeated. its like the api returns a tweet more than one time. how can i fix that?

    thank you in advance

    Like

    1. Hi, on_data() is the entry point for any sort of data received, on_status() is specific for receiving statuses. You can implement directly the on_status method if you prefer.

      Cheers,
      Marco

      Like

  14. How would you store all your tweets in a separate json file instead of just printing them all out like you show in the process_or_store function? I see in later posts you reference the most common words in your tweets and I wanted to know how to do that.

    Like

    1. Hi Aidan, similarly to the on_data() method in the streaming example, you can open the file and then just dump the json in it, something similar to:

      with open('my_tweets.jsonl', 'w') as f:
          for tweet in tweepy.Cursor(api.user_timeline).items():
              f.write(json.dumps(tweet._json)+"\n")
      

      (even without defining a custom process_or_store() here)
      Cheers,
      Marco

      Like

      1. Hey thanks for the help!
        When I run your suggested code I’m getting an error:
        TypeError: unsupported operand type(s) for +: ‘dict’ and ‘str’

        Something to do with the +”\n” part I think, but I can’t figure out how to make it work and an internet search hasn’t helped much.
        Any further advice you can give me would be hugely appreciated

        Like

    2. I’ve updated the example: tweet._json is in fact a dictionary (loaded from the original json string, but still not a string), so you need json.dumps() first
      Cheers,
      Marco

      Like

      1. Thank you for the help.
        If I were trying to do the same thing but get all the tweets I have favorited in a json file, would the the code structure be the same?
        I tried with api.favorites instead of api.user_timeline, but I get an error code 429 which means too many requests.
        I guess what I’m asking is is there a way to get around the rate limit in this case and what would you suggest?

        Thanks again,
        Aidan

        Like

  15. Marco,

    Great page. Very helpful. I am a beginner who is attempting to use your guide to locate the geographical origin of certain hashtags.

    For the process_or_store() function, how do I tweak its definition so that it saves the stream of tweets I collect as a json file (such as the ‘mytweets’ one you refer to in the second part of this guide).

    Thanks in advance.

    Like

    1. Hi Michael, thanks for the comment.

      when downloading your tweets, you can for example do something like:

      with open('my_tweets.jsonl', 'w') as f:
          for tweet in tweepy.Cursor(api.user_timeline).items():
              f.write(json.dumps(tweet._json)+"\n")

      (here you don’t even need a custom process_or_store() )

      For the streaming part, for example to download tweets with a given hashtag, the MyListener class defined in the article already stores the tweets on a file.

      Also, please have a look at this: https://marcobonzanini.com/2016/08/02/mastering-social-media-mining-with-python/ (chapter 2, about mining Twitter, is available as a free sample from the publisher’s web site, and more sample code is available on my Github).

      Cheers,
      Marco

      Like

  16. Marco,

    Is there an easy way to put a time limit on the stream of data?

    I was thinking of something like
    import time
    start_time =time.time()
    end_time = start_time + 10

    if end_time < time.time():
    sys.exit()

    So my code looks like this:

    import time

    ..

    start_time =time.time()
    end_time = start_time + 10

    from tweepy import Stream
    from tweepy.streaming import StreamListener

    class MyListener(StreamListener):

    def on_data(self, data):
    start = time.time()
    try:
    with open('fresh2.json', 'a') as f:
    f.write(data)
    return True
    except BaseException as e:
    print('Error on_data', str(e))
    return True

    finally:
    if end_time < time.time():
    sys.exit()

    But this does not work? Thanks in advance. Excellent page.

    Like

    1. Hi Peter,
      off the top of my head, the simplest option if you want to achieve this programmatically, you could set the desired end-time in the constructor of the custom listener, and then check for the timing when you received the data e.g.

      class MyListener(StreamListener):
          def __init__(self):
              self.end_time = time.time() + 10
      
          def on_data(self):
              if time.time() > self.end_time:
                  sys.exit()
              # plus the rest of the class definition
      

      (I haven’t tested this code, might need some tweaking)
      one problem with this approach is that the check is triggered only if you receive data, so for a very low-volume stream you might keep the connection going for more than 10 seconds.

      Another option is to use some OS facility, e.g. Linux has a timeout command (also available on mac via “brew install coreutils”, in this case called gtimeout):

      $ gtimeout 10s python my_file.py

      this will run the python script and kill it after 10 seconds.

      Cheers,
      Marco

      Like

  17. Hi Marco

    I bought your “Mastering Social Media Mining with Python” book, and I’m trying to get the accompanying code on Github to work. I’m at twitter_streaming, which I haven’t been able to get working.

    Following the instructions in the book, I did the following
    – created a virtual environment
    – set the four parameters (customer key and secret, access token and secret) as environment variables
    – ran the twitter_client.py script (for authentication) from windows cmd prompt as follows
    $ python twitter_client.py
    – ran the twitter_streaming.py script from windows cmd prompt with the keywords as follows
    $ python twitter_streaming.py keyword1 keyword2 keyword 3

    I’m getting 401 errors on running twitter_streaming.py. I tried to abort/exit the execution using Ctrl+C but no success – I’m unable to abort/exit the execution, and i’m getting a continuous stream of 401 errors.

    My questions
    – how to make sure that twitter_client.py was successfully executed?
    – why do you think I’m getting 401 errors on executing twitter_streaming.py? and how can I abort/exit the execution to get to the prompt?
    – is there a way to create the virtual environment and set the virtual environment variables and execute the accompanying scripts from inside the python interpreter rather than from the cmd prompt?

    Note:
    – I’ve been using python for about a year now, but still consider myself a beginner.

    Thank you in advance
    Ahmed

    Like

  18. Hi Ahmed

    error 401 is usually given because the credentials are incorrect, so possibly a copy-paste mismatch? If the variables are not set at all for the current session, the script would raise an error and quit.

    You don’t need to call twitter_client.py explicitly, because it’s used by the other scripts to set up the authentication.If you set the environment variables from the command line, keep in mind that these variables are scoped in your existing session, i.e. when you close the console window they are not kept. I usually put all the environment setting commands in a single shell script that I run once per session (I also make sure that it’s ignored by the source control and not checked in). You can check the value of these variables using “echo %VAR_NAME%”, for example echo %TWITTER_CONSUMER_KEY%

    If you’re not comfortable with defining the environment in this way, you can still hard-code the values, e.g. in twitter_client.py you can replace the get_twitter_auth() function:

    def get_twitter_auth():
        consumer_key = 'YOUR-CONSUMER-KEY-HERE'
        consumer_secret = 'YOUR-CONSUMER-SECRET-HERE'
        access_token = 'YOUR-ACCESS-TOKEN-HERE'
        access_secret = 'YOUR-ACCESS-SECRET-HERE'
        auth = OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_token, access_secret)
        return auth

    This is not a great design as explained in Chapter 2 (separation of concerns between app logic and config), but it’s good enough to get you started, just make sure that you don’t accidentally push your personal keys to github :)
    Cheers,
    Marco

    Like

    1. Hi
      JSON can represent a nested structure, while CSV only a flat record-like one, so in the general case you can’t directly map JSON to CSV. You first need to find a way to normalise the JSON data. Once you ensure your data structure is flat, you can use the Python csv module (csv.writer in particular) to produce the CSV output.
      Cheers,
      Marco

      Like

  19. I am trying to gather all the tweets about a particular event through the streaming API. I am able to get data for #apple but when I am trying to stream data for #AirbnbWhileBlack, I am getting 1 tweet/30 min.

    I have tried by registering a new app on Tweeter website, but stills facing the same issue. I have searched on Google but didn’t figure any solution for the problem. Does anyone have any idea, how can I resolve this problem? Or any website through which I can collect data on #AirbnbWhileBlack.

    Like

    1. Hi nayansinghal,
      from what I can see, that hashtag has a very low frequency at the moment, i.e. only a few tweets per day, so what you’re observing is in fact correct. You can look into the Twitter Search API (also supported by tweepy) rather than the streaming used here, so you can go back to approximatively the last week (but some tweets/users might be missing from the results, as explained in the documentation):
      https://dev.twitter.com/rest/public/search
      http://docs.tweepy.org/en/v3.5.0/api.html#API.search

      Cheers,
      Marco

      Like

      1. Thanks Marco for your response. I checked the Twitter search API and got only 70 tweets overall. I searched on Google and figured out that these APIs will provide you tweet only for past 7 days. Is there any other method through which I can get tweets before 1 week?

        I know one of the solution is to extract tweet using web page scrapper but that doesn’t seem me a good idea.

        Like

    1. Hi,
      you need to check the correct indentation of the code as shown in the article (sometimes with copy&paste the indentation gets lost)
      Cheers,
      Marco

      Like

  20. I’m interested in getting tweets of somebody’s followers, but I only get up to 20 followers, even though this person has more, why does this happen?

    Like

    1. Hi, I suggest you do some post-processing after the streaming is closed.
      For example:

      import json
      
      MIN_RT = 5
      
      with open('tweets.json', 'r') as fin, open('tweets_filtered.json', 'w') as fout:
          for line in fin:
              tweet = json.loads(line)
              if tweet['retweet_count'] >= MIN_RT:
                  fout.write(line)

      This will read your tweets.json file and create a tweets_filtered.json with only tweets that have been retweeted at least 5 times
      Cheers,
      Marco

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s