Data Mining · Python

Mining Twitter Data with Python (Part 1: Collecting data)

Twitter is a popular social network where users can share short SMS-like messages called tweets. Users share thoughts, links and pictures on Twitter, journalists comment on live events, companies promote products and engage with customers. The list of different ways to use Twitter could be really long, and with 500 millions of tweets per day, there’s a lot of data to analyse and to play with.

This is the first in a series of articles dedicated to mining data on Twitter using Python. In this first part, we’ll see different options to collect data from Twitter. Once we have built a data set, in the next episodes we’ll discuss some interesting data applications.

Update July 2016: my new book on data mining for Social Media is out! Part of the content in this tutorial has been improved and expanded as part of the book, so please have a look. Chapter 2 about mining Twitter is available as a free sample from the publisher’s web site, and the companion code with many more examples is available on my GitHub

Table of Contents of this tutorial:

More updates: fixed version number of Tweepy to avoid problem with Python 3; fixed discussion on _json to get the JSON representation of a tweet; added example of process_or_store().

Register Your App

In order to have access to Twitter data programmatically, we need to create an app that interacts with the Twitter API.

The first step is the registration of your app. In particular, you need to point your browser to, log-in to Twitter (if you’re not already logged in) and register a new application. You can now choose a name and a description for your app (for example “Mining Demo” or similar). You will receive a consumer key and a consumer secret: these are application settings that should always be kept private. From the configuration page of your app, you can also require an access token and an access token secret. Similarly to the consumer keys, these strings must also be kept private: they provide the application access to Twitter on behalf of your account. The default permissions are read-only, which is all we need in our case, but if you decide to change your permission to provide writing features in your app, you must negotiate a new access token.

Important Note: there are rate limits in the use of the Twitter API, as well as limitations in case you want to provide a downloadable data-set, see:

Accessing the Data

Twitter provides REST APIs you can use to interact with their service. There is also a bunch of Python-based clients out there that we can use without re-inventing the wheel. In particular, Tweepy in one of the most interesting and straightforward to use, so let’s install it:

pip install tweepy==3.3.0

Update: the release 3.4.0 of Tweepy has introduced a problem with Python 3, currently fixed on github but not yet available with pip, for this reason we’re using version 3.3.0 until a new release is available.

More Updates: the release 3.5.0 of Tweepy, already available via pip, seems to solve the problem with Python 3 mentioned above.

In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:

import tweepy
from tweepy import OAuthHandler

consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

The api variable is now our entry point for most of the operations we can perform with Twitter.

For example, we can read our own timeline (i.e. our Twitter homepage) with:

for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status

Tweepy provides the convenient Cursor interface to iterate through different types of objects. In the example above we’re using 10 to limit the number of tweets we’re reading, but we can of course access more. The status variable is an instance of the Status() class, a nice wrapper to access the data. The JSON response from the Twitter API is available in the attribute _json (with a leading underscore), which is not the raw JSON string, but a dictionary.

So the code above can be re-written to process/store the JSON:

for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status

What if we want to have a list of all our followers? There you go:

for friend in tweepy.Cursor(api.friends).items():

And how about a list of all our tweets? Simple:

for tweet in tweepy.Cursor(api.user_timeline).items():

In this way we can easily collect tweets (and more) and store them in the original JSON format, fairly easy to convert into different data models depending on our storage (many NoSQL technologies provide some bulk import feature).

The function process_or_store() is a place-holder for your custom implementation. In the simplest form, you could just print out the JSON, one tweet per line:

def process_or_store(tweet):


In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. We need to extend the StreamListener() to customise the way we process the incoming data. A working example that gathers all the new tweets with the #python hashtag:

from tweepy import Stream
from tweepy.streaming import StreamListener

class MyListener(StreamListener):

    def on_data(self, data):
            with open('python.json', 'a') as f:
                return True
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True

    def on_error(self, status):
        return True

twitter_stream = Stream(auth, MyListener())

Depending on the search term, we can gather tons of tweets within a few minutes. This is especially true for live events with a world-wide coverage (World Cups, Super Bowls, Academy Awards, you name it), so keep an eye on the JSON file to understand how fast it grows and consider how many tweets you might need for your tests. The above script will save each tweet on a new line, so you can use the command wc -l python.json from a Unix shell to know how many tweets you’ve gathered.

You can see a minimal working example of the Twitter Stream API in the following Gist:


We have introduced tweepy as a tool to access Twitter data in a fairly easy way with Python. There are different types of data we can collect, with the obvious focus on the “tweet” object.

Once we have collected some data, the possibilities in terms of analytics applications are endless. In the next episodes, we’ll discuss some options.


Table of Contents of this tutorial:

134 thoughts on “Mining Twitter Data with Python (Part 1: Collecting data)

  1. Dear Marco.
    How do I get twitters from other users that are not from my time line, that is, in general. It’s possible? as? What is the limit of twitters I can get in python at a time.


    1. Hi, I haven’t worked much on MySQL recently, but recent versions support a JSON data type so you just create a column of type JSON and dump the entire tweet in it. Another option is that you read the JSON and normalise the structure with only the fields that you need.


  2. please help !

    i get this error :

    File “C:/Users/Hamza-HP/Desktop/untitled/”, line 24, in
    File “C:/Users/Hamza-HP/Desktop/untitled/”, line 18, in process_or_store
    NameError: name ‘_json’ is not defined

    but i define json as follow :

    import simplejson as json

    i don’t really know what’s wrong :(


    1. Hi
      if you import simplejson as json, you’ll need json.dumps(tweet) rather than _json (which is an attribute of status, but not defined in the main space


    1. Hi,
      in the MyListener.on_data() method you can print() the data instead of (or in addition to) writing to file.
      For hdfs you’ll need an hdfs client but I don’t have any particular recommendation here.



  3. Hey Marco, I’m relatively new to coding and I was trying out your script to see 10 of my twitter feeds but its giving me a UnicodeEncodeError. What else do I need to add to the script that you provided to make this work?


    1. Hi Nav,
      it depends on which line is trowing the error and what the exact error is. The examples are tested in Python 3.4+, so if you’re using Python 2 please keep in mind that the string data type is different (unicode in Python 3, non-unicode in Python 2). If that’s the case, please consider upgrading to Python 3. Also have a look at this one for more details:


  4. Hi,
    I tried to the Streaming example given above. But unfortunately it gives a syntax error indicating the ‘&’ symbol before ‘quote’, in “print("Error on_data: %s" % str(e))” line.
    How can I get this error fixed ?

    Thanks in advance !


    1. hi, i got same error.
      i just put commend on exception like this :
      #print("Error on_data: %s" % str(e))
      and put this code on the next line or replace it both working wonderfull :
      print (‘erorr’, str(e))


      1. Hi,
        wordpress keeps messing around with quotes and HTML symbols, I think this is fixed now, until it breaks the next time. The symbol should be a regular double quote like this “


  5. Hi – can you recommend a hosting platform for a python listener such as this ? e.g. a host that will offer everything required and not charge the earth for an “always on” app ? Thanks !


    1. Hi Justin, in terms of hosting I’ve only used aws for this, not sure if it fits your requirements. I’m sure there are other options but I don’t have a specific recommendation at the moment I’m afraid.


  6. Hello, these days I have saved several tweets with your script.
    Today I was watching them and I seem all written by professionals and not by individual.
    They are tweets that talk about the weather, news, promotion activities etc …
    it’s normal? or am I doing something wrong?
    For me it would be more useful to analyze the common people tweet, in order to have a vision of what people really think.
    Here I have published some are in Italian, but your last name seems Italian: D


  7. print("Error on_data: %s" % str(e))
    SyntaxError: invalid syntax
    I keep getting this error running in Andaconda. Could get it to run the 10 post but cant get it to run live stream.


    1. Hi
      if you use the code as it is, it creates a “python.json” file in the same folder where you’re running the script. If you check out the examples from my book you see how the filename is created dynamically.



    1. Hi
      yes you can come up with the file name dynamically, using a different name every time, for example using the tweet id to ensure the names are unique. Maybe adding the timestamp as well so you can sort them “easily” (you’ll end up with too many files)



  8. hey can i filter the tweets about political views in a certain country like for Zambia only..reply asap…


      1. Hi Marco, I have read that one can only stream based on a keyword or location but not both. Does that sound right to you? thanks


  9. Hi
    Actually I have to do sentiment analysis and for that purpose I need to collect some Twitter data so can you please tell how to we get consumer I’d and consumer secret mentioned in your tutorials first part.
    How the application will be set up ?


  10. HI Marco, I really enjoy reading your blogs and attending meetups you speak at. I am trying to collect a year’s worth twitter data from the current date based on selected keywords . I’ve used the twitter search API but it only seems to give me 12 days worth results (around 3000 tweets) for a keyword. Would I be able to get results from a longer period using tweepy (ideally I would like to specify the start and end date for my search) ? or would I need to subscribe to Twitter Firehouse ?

    Best Wishes


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s