Mining Twitter Data with Python (Part 1: Collecting data)

Twitter is a popular social network where users can share short SMS-like messages called tweets. Users share thoughts, links and pictures on Twitter, journalists comment on live events, companies promote products and engage with customers. The list of different ways to use Twitter could be really long, and with 500 millions of tweets per day, there’s a lot of data to analyse and to play with.

This is the first in a series of articles dedicated to mining data on Twitter using Python. In this first part, we’ll see different options to collect data from Twitter. Once we have built a data set, in the next episodes we’ll discuss some interesting data applications.

Update July 2016: my new book on data mining for Social Media is out! Part of the content in this tutorial has been improved and expanded as part of the book, so please have a look. Chapter 2 about mining Twitter is available as a free sample from the publisher’s web site, and the companion code with many more examples is available on my GitHub

Table of Contents of this tutorial:

Part 1: Collecting Data (this article)
Part 2: Text Pre-processing
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

More updates: fixed version number of Tweepy to avoid problem with Python 3; fixed discussion on _json to get the JSON representation of a tweet; added example of process_or_store().

Register Your App

In order to have access to Twitter data programmatically, we need to create an app that interacts with the Twitter API.

The first step is the registration of your app. In particular, you need to point your browser to http://apps.twitter.com, log-in to Twitter (if you’re not already logged in) and register a new application. You can now choose a name and a description for your app (for example “Mining Demo” or similar). You will receive a consumer key and a consumer secret: these are application settings that should always be kept private. From the configuration page of your app, you can also require an access token and an access token secret. Similarly to the consumer keys, these strings must also be kept private: they provide the application access to Twitter on behalf of your account. The default permissions are read-only, which is all we need in our case, but if you decide to change your permission to provide writing features in your app, you must negotiate a new access token.

Important Note: there are rate limits in the use of the Twitter API, as well as limitations in case you want to provide a downloadable data-set, see:

Accessing the Data

Twitter provides REST APIs you can use to interact with their service. There is also a bunch of Python-based clients out there that we can use without re-inventing the wheel. In particular, Tweepy in one of the most interesting and straightforward to use, so let’s install it:

pip install tweepy==3.3.0

Update: the release 3.4.0 of Tweepy has introduced a problem with Python 3, currently fixed on github but not yet available with pip, for this reason we’re using version 3.3.0 until a new release is available.

More Updates: the release 3.5.0 of Tweepy, already available via pip, seems to solve the problem with Python 3 mentioned above.

In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:

import tweepy
from tweepy import OAuthHandler

consumer_key = 'YOUR-CONSUMER-KEY'
consumer_secret = 'YOUR-CONSUMER-SECRET'
access_token = 'YOUR-ACCESS-TOKEN'
access_secret = 'YOUR-ACCESS-SECRET'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

The api variable is now our entry point for most of the operations we can perform with Twitter.

For example, we can read our own timeline (i.e. our Twitter homepage) with:

for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    print(status.text)

Tweepy provides the convenient Cursor interface to iterate through different types of objects. In the example above we’re using 10 to limit the number of tweets we’re reading, but we can of course access more. The status variable is an instance of the Status() class, a nice wrapper to access the data. The JSON response from the Twitter API is available in the attribute _json (with a leading underscore), which is not the raw JSON string, but a dictionary.

So the code above can be re-written to process/store the JSON:

for status in tweepy.Cursor(api.home_timeline).items(10):
    # Process a single status
    process_or_store(status._json)

What if we want to have a list of all our followers? There you go:

for friend in tweepy.Cursor(api.friends).items():
    process_or_store(friend._json)

And how about a list of all our tweets? Simple:

for tweet in tweepy.Cursor(api.user_timeline).items():
    process_or_store(tweet._json)

In this way we can easily collect tweets (and more) and store them in the original JSON format, fairly easy to convert into different data models depending on our storage (many NoSQL technologies provide some bulk import feature).

The function process_or_store() is a place-holder for your custom implementation. In the simplest form, you could just print out the JSON, one tweet per line:

def process_or_store(tweet):
    print(json.dumps(tweet))

Streaming

In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. We need to extend the StreamListener() to customise the way we process the incoming data. A working example that gathers all the new tweets with the #python hashtag:

from tweepy import Stream
from tweepy.streaming import StreamListener

class MyListener(StreamListener):

    def on_data(self, data):
        try:
            with open('python.json', 'a') as f:
                f.write(data)
                return True
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True

    def on_error(self, status):
        print(status)
        return True

twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['#python'])

Depending on the search term, we can gather tons of tweets within a few minutes. This is especially true for live events with a world-wide coverage (World Cups, Super Bowls, Academy Awards, you name it), so keep an eye on the JSON file to understand how fast it grows and consider how many tweets you might need for your tests. The above script will save each tweet on a new line, so you can use the command wc -l python.json from a Unix shell to know how many tweets you’ve gathered.

You can see a minimal working example of the Twitter Stream API in the following Gist:

twitter_stream_downloader.py

Summary

We have introduced tweepy as a tool to access Twitter data in a fairly easy way with Python. There are different types of data we can collect, with the obvious focus on the “tweet” object.

Once we have collected some data, the possibilities in terms of analytics applications are endless. In the next episodes, we’ll discuss some options.

@MarcoBonzanini

Table of Contents of this tutorial:

Part 1: Collecting Data (this article)
Part 2: Text Pre-processing
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps

Published by

Marco

Data Scientist View all posts by Marco

167 thoughts on “Mining Twitter Data with Python (Part 1: Collecting data)”

miguelmalvarez says:

March 3, 2015 at 1:22 pm

Very nice post, as usual. I am actually using part of this code to build a small application to download all the images of a twitter account. I only one small comment, for the sake of completion, I believe you should also import json in your example.

LikeLiked by 1 person

Reply
Pingback: Download the pictures from a twitter feed using Python | The Practical Academic
[kingkonganalytics] says:

March 13, 2015 at 9:03 pm

Reblogged this on [kingkonganalytics] and commented:
An alternative way to Mine Twitter Data using Python instead of R … Great Tutorial for anyone familiar with Python, and webscraping in particular. Good job, Marco Bonzanini.

LikeLike

Reply
Oleg Baskov says:

May 1, 2015 at 12:21 pm

Reblogged this on Oleg Baskov.

LikeLike

Reply
Bozhidar says:

May 18, 2015 at 11:45 am

Can you help me with this error: “NameError: name ‘process_or_store’ is not defined”

LikeLike

Reply
1. Marco says:
  
  May 18, 2015 at 6:00 pm
  
  Hi Bozhidar,
  the function is not defined in the code sample (hence the error), it’s just a placeholder for you to personalise depending on your needs (storing the data, do some pre-processing as described in Part 2, etc.). In the simplest form, you can just substitute it with a print() and dump all the JSON in a plain file
  
  LikeLike
  
  Reply
  1. Afnan says:
    
    November 10, 2015 at 4:11 pm
    
    can you please explain more?
    
    LikeLike
  2. Afnan says:
    
    November 10, 2015 at 4:15 pm
    
    I tried to solve it this way and i get this error: AttributeError: ‘Status’ object has no attribute ‘json’
    can you help me!
    
    LikeLike
  3. Marco says:
    
    November 11, 2015 at 7:17 am
    
    Hi Afnan, You should use the _json attribute (with a leading underscore)
    
    LikeLike
  4. cx7gavirginiaedu says:
    
    October 14, 2016 at 8:09 pm
    
    huh. I got stuck there for a while
    
    LikeLike
  5. Samuel says:
    
    January 19, 2017 at 6:36 am
    
    I tried to dump it with the command print() but it didn’t give me a meaningful answer. could you show me the way how to store the JSON in a simple file?
    
    LikeLike
2. VAsu says:
  
  August 22, 2016 at 11:13 am
  
  That’s just a placeholder of what you want to do.Ex:you can use ‘print’,to print line by line.
  
  LikeLike
  
  Reply
vickydasta says:

May 29, 2015 at 1:16 pm

Reblogged this on (b)log.

LikeLike

Reply
Deepa says:

June 22, 2015 at 7:11 pm

Am working on a project using Python & Twitter and I chanced upon your site! Really cool & very useful! My interests lie in data science too, so I’ll be back :)

LikeLike

Reply
basilspike says:

June 23, 2015 at 12:55 am

Hi Marco,

Really great intro. Don’t want to take up your time debugging, but was wondering if you know how to add a timer to the data collection process. I tried the following….

if (time.time() – start_time)/60 <= 1:
twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(track=['#python'])
else:
print("—collection terminated at %s seconds —") %round((time.time() -start_time),2)
import sys
sys.exit()

It runs, but doesn't stop after 1 minute like intended. Any ideas?

LikeLike

Reply
1. Marco says:
  
  June 23, 2015 at 6:14 am
  
  hi basilspike,
  
  with the if/else described in that way, the else branch is never executed (so the script doesn’t stop).
  
  One option is to check the timer in the on_data() method, for example before saving/printing the tweets you could do:
  
  if (time.time() – self.start_time)/60 >= 1:
  return False
  
  You need to set self.start_time in the __init__ method. Please notice that in this case the script won’t stop after one minute, but only when you receive the first tweet that arrives after the one-minute timeout (could be a big difference for low volume streams).
  
  Another option would be to use a subprocess for the download, and terminate it after the timeout is gone.
  
  LikeLike
  
  Reply
Manas Chaturvedi says:

June 25, 2015 at 9:50 pm

Hey !

This is the error I’m encountering on running the StreamListener code:

C:\Python27\lib\site-packages\requests-2.7.0-py2.7.egg\requests\packages\urllib3\util\ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
C:\Python27\lib\site-packages\requests-2.7.0-py2.7.egg\requests\packages\urllib3\connection.py:251: SecurityWarning: Certificate has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
SecurityWarning

What does this error signify? And how can I debug this error?

LikeLike

Reply
1. Marco says:
  
  June 26, 2015 at 5:49 am
  
  Hi Manas,
  probably you’ll need to upgrade to a newer Python version, please have a look at the two links in the warning message you posted (the first one in particular)
  
  LikeLike
  
  Reply
  1. Edgardo says:
    
    October 14, 2016 at 9:27 pm
    
    Hey Marco, thanks a lot for this tutorial. How do I delete the lines from the json file after “n” minutes and keep streaming. I want to do this in time windows. For example:
    
    1. Start the stream.
    2. After 10 minutes, delete the lines from the json file.
    3. Start writing again the tweets.
    4. After 10 minutes, delete the lines again.
    5. Start writing again the tweets.
    6. Repeat 2 and 3 in kind of loop (I don’t know how to call it)
    
    I want to do this because I want to clean my json file after an amount of time.
    
    Ps: Sorry for my english. I’m just learning.
    
    LikeLike
Pingback: 5 interesting things (12/07/2015) | Tom Ron
Catalin C. Vasii says:

August 29, 2015 at 12:49 pm

Hi Marco,

I got this error when streaming:
Can’t convert ‘bytes’ object to str implicitly

Any idea? Thanks!

LikeLike

Reply
che says:

August 30, 2015 at 12:39 pm

Hi Catalin

Change the following lines in streaming.py
l161: self._buffer += self._stream.read(read_len).decode(‘ascii’)
l171: self._buffer += self._stream.read(self._chunk_size).decode(‘ascii’)

This fixes it for python3.4, I do not know about python2.7

LikeLike

Reply
1. Marco says:
  
  August 30, 2015 at 3:16 pm
  
  Hi Catalin, hi che and thanks for your input.
  
  To expand on this, the problem is introduced with version 3.4.0 of tweepy for Python 3. Version 3.3.0 of tweepy, the one used when this article was written, is immune from this problem on Python 3.
  
  The issue is open on github: https://github.com/tweepy/tweepy/issues/615
  The workaround suggested by che works for Python 3, but if you don’t want to tamper with the libraries you could also simply downgrade tweepy, e.g.
  pip install -I tweepy==3.3.0
  
  LikeLike
  
  Reply
Javier says:

September 11, 2015 at 9:00 am

Thanks Marco for your twitter answer. However, I am new in Python, and I would need a bit more of information. For now I am trying different packages, but I have difficulties making them work. I want to ask you what I need to run code such as Tweepy. I have Anaconda, Pycharm, text wrangler but I cannot find the way to run the code. I have installed tweetpy through the terminal, and when I try to use it in terminal, pycharm or anaconda, it gives different errors. Probably because python is not well set up, I am not using IDE properly, or any other beginner’ reason

LikeLike

Reply
1. Marco says:
  
  September 11, 2015 at 6:25 pm
  
  Hi Javier, thanks for following up here.
  So from the command line, firstly check that you can invoke the python interpreter correctly, e.g. typing “python –version” (it should give you 3.4.3 if you have the latest anaconda for python 3).
  Anaconda doesn’t come with a package for tweepy, so if you try “conda install tweepy” you’ll see an error. You can anyway install it with “pip install tweepy==3.3.0”.
  I’ve updated the post with a small example of how to use the stream API posted on github: https://gist.github.com/bonzanini/af0463b927433c73784d
  so you can just save the file and run it (assuming you have the right credentials/tokens as explained in the article)
  
  LikeLike
  
  Reply
Sultan (@Sultarazi) says:

September 14, 2015 at 1:21 am

Hey, thank you for your effort. The output keeps giving me “401” ? What would it be?

LikeLike

Reply
1. Marco says:
  
  September 14, 2015 at 7:06 am
  
  Hi, 401 is the status code for “unauthorized”. Usually this problem is related to either missing or incorrect credentials (the OAuth part at the beginning of the article), so that’s where to look first.
  
  LikeLike
  
  Reply
  1. Sultan (@Sultarazi) says:
    
    September 15, 2015 at 1:40 am
    
    Hey Marco, thank you for your replay
    
    The problem is when I executed the previous commands such as user_timeline, friends, and home_timeline .. those are working well and give me right output using the same credentials. When I use it in the streaming part, it give me 401?
    
    LikeLike
Pingback: Python, Tweepy, and the Ethernet Shield | catherine lee | art & ideas
Nicolai Nyströmer says:

November 1, 2015 at 4:50 pm

Hi Marco – I’m rather new at working with API’s, and your rundown of how to access Twitter data is of great help. Thanks!

I have one question though. It seems to me that the data is already available in JSON. By making use of the ._json key that is directly available through Tweepy, you don’t have to define and apply the parse method:

from pprint import pprint
for status in tweepy.Cursor(api.user_timeline).items():
… pprint (status._json)

Correct me if I’m wrong – I’m by no means an expert in this field.

LikeLike

Reply
1. Marco says:
  
  November 1, 2015 at 5:01 pm
  
  Hi Nicolai, that’s correct and it’s something I have to fix in the article, it’s been in the pipeline for a while. It’s worth mentioning that the _json attribute is a dict, not the raw JSON string. Thanks for your input.
  
  Cheers
  Marco
  
  LikeLike
  
  Reply
Jigar Mehta says:

November 6, 2015 at 3:17 pm

Hi Macro,

Thanks for the great article. But, I see that place field has null values. I also want to fetch the country or location fom where the text is being tweeted. Pls help.

LikeLike

Reply
1. Marco says:
  
  November 6, 2015 at 10:35 pm
  
  Hi Jigar, places and coordinates are given optionally by the users and are often omitted. Unfortunately, only a small share of tweets has these data set explicitly. Probably you’ll need to collect more data to see something interesting.
  Cheers,
  Marco
  
  LikeLike
  
  Reply
deena says:

November 10, 2015 at 5:31 pm

Hi Marco
When I wrote the first command line “import weepy” from python, I got the error
Traceback (most recent call last):
File “”, line 1, in
ImportError: No module named ‘tweepy’

I upgraded to python3 and it still gives same error, any clue why this happens?
Thanks
Deena

LikeLike

Reply
1. deenaabulfottouh says:
  
  November 10, 2015 at 5:39 pm
  
  This is what I got when I was downloading tweepy thru sudo pip install tweepy
  Downloading/unpacking tweepy
  Downloading tweepy-3.4.0.tar.gz
  Running setup.py egg_info for package tweepy
  Traceback (most recent call last):
  File “”, line 14, in
  File “/Users/deena/build/tweepy/setup.py”, line 17, in
  install_reqs = parse_requirements(‘requirements.txt’, session=uuid.uuid1())
  TypeError: parse_requirements() got an unexpected keyword argument ‘session’
  Complete output from command python setup.py egg_info:
  Traceback (most recent call last):
  
  File “”, line 14, in
  
  File “/Users/deena/build/tweepy/setup.py”, line 17, in
  
  install_reqs = parse_requirements(‘requirements.txt’, session=uuid.uuid1())
  
  TypeError: parse_requirements() got an unexpected keyword argument ‘session’
  
  LikeLike
  
  Reply
deenaabulfottouh says:

November 10, 2015 at 5:34 pm

Hi Marco
I tried to write the command import tweepy from python 2.7 but it gave me the error
Traceback (most recent call last):
File “”, line 1, in
ImportError: No module named ‘tweepy’
I upgraded to python3, still same error. Any clue why this happens?
Thanks
Deena

LikeLike

Reply
1. Marco says:
  
  November 11, 2015 at 7:21 am
  
  Hi deena, the examples are tested on Python 3. You should be able to install the correct version on Tweepy with:
  pip install tweepy==3.3.0
  (I’ve updated the article with an explanation on the version numbers)
  In general it’s more sensible to use a virtualenv rather than sudo
  
  LikeLike
  
  Reply
christina D says:

November 30, 2015 at 5:56 pm

hello everyone, i have to go through the process of collecting tweets for a certain period (1/2015-9/2015) for my thesis, reading the article i started to register and create a new app but i have no website and the field is required. any suggestions?

thank you in advance

LikeLike

Reply
1. Marco says:
  
  December 1, 2015 at 6:33 pm
  
  Hi Christina, the website is required but you can put a placeholder (as suggested also by Twitter). For example your twitter handle, or your github page, will do
  
  LikeLike
  
  Reply
  1. christina D says:
    
    December 2, 2015 at 6:37 am
    
    thank you very much Marco
    
    LikeLike
christina D says:

December 5, 2015 at 9:28 am

Marco another question,

i tried to install tweepy. i got stuck in cmd when i type python setup.py install i get this: AttributeError: ‘str’ object has no attribute ‘req’.

LikeLike

Reply
1. Marco says:
  
  December 5, 2015 at 12:26 pm
  
  Hi, I’d recommend to use pip with a virtualenv to install the libraries, rather than using the bleeding-edge version from github — and keep in mind that the code in these articles has been tested on Python 3.4
  
  Cheers, Marco
  
  LikeLike
  
  Reply
  1. christina D says:
    
    December 6, 2015 at 5:07 pm
    
    thank you very much Marco, your answers are very helpfull. Another mabe silly question. can i run the code with python IDLE? Im at the point where i start to access the data with the OAuth interface
    
    LikeLike
  2. Marco says:
    
    December 7, 2015 at 9:20 am
    
    You can of course run the code from an interactive environment, although the sample script (https://gist.github.com/bonzanini/af0463b927433c73784d) is meant to be run from the command line. I think the script is more convenient for downloading the data initially, then you can do the data analysis interactively from IDLE/REPL if you wish
    
    LikeLike
christina D says:

December 8, 2015 at 11:12 am

thank you very much Marco, everything worked just fine!

LikeLike

Reply
Pingback: Collecting Twitter data with Python – Alexander Galea's Blog
Pingback: Twitter Analysis – Part 1: Getting Started | Tauseef Khan
Pingback: Mining Twitter Data with Python (Part 1: Collecting data) | SimpleSQLDBA | Shadab Mohammad
christina D says:

May 22, 2016 at 11:40 am

hello everyone! i would like to ask why when retreiving tweets i have some repeated. its like the api returns a tweet more than one time. how can i fix that?

thank you in advance

LikeLike

Reply
xxwa says:

June 8, 2016 at 6:31 am

what is the difference between “def on_data ” and “def on_status ” ??

LikeLike

Reply
1. Marco says:
  
  June 8, 2016 at 3:30 pm
  
  Hi, on_data() is the entry point for any sort of data received, on_status() is specific for receiving statuses. You can implement directly the on_status method if you prefer.
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Aidan says:

July 14, 2016 at 9:57 pm

How would you store all your tweets in a separate json file instead of just printing them all out like you show in the process_or_store function? I see in later posts you reference the most common words in your tweets and I wanted to know how to do that.

LikeLike

Reply
1. Marco says:
  
  July 15, 2016 at 5:27 am
  Hi Aidan, similarly to the on_data() method in the streaming example, you can open the file and then just dump the json in it, something similar to:
```
with open('my_tweets.jsonl', 'w') as f:
    for tweet in tweepy.Cursor(api.user_timeline).items():
        f.write(json.dumps(tweet._json)+"\n")
```
  (even without defining a custom process_or_store() here)
  Cheers,
  Marco
  
  LikeLike
  Reply
  1. Aidan says:
    
    July 22, 2016 at 11:46 pm
    
    Hey thanks for the help!
    When I run your suggested code I’m getting an error:
    TypeError: unsupported operand type(s) for +: ‘dict’ and ‘str’
    
    Something to do with the +”\n” part I think, but I can’t figure out how to make it work and an internet search hasn’t helped much.
    Any further advice you can give me would be hugely appreciated
    
    LikeLike
2. Marco says:
  
  July 23, 2016 at 7:35 am
  
  I’ve updated the example: tweet._json is in fact a dictionary (loaded from the original json string, but still not a string), so you need json.dumps() first
  Cheers,
  Marco
  
  LikeLike
  
  Reply
  1. Aidan says:
    
    August 2, 2016 at 1:00 am
    
    Thank you for the help.
    If I were trying to do the same thing but get all the tweets I have favorited in a json file, would the the code structure be the same?
    I tried with api.favorites instead of api.user_timeline, but I get an error code 429 which means too many requests.
    I guess what I’m asking is is there a way to get around the rate limit in this case and what would you suggest?
    
    Thanks again,
    Aidan
    
    LikeLike
3. Marco says:
  
  August 2, 2016 at 5:22 am
  
  Hi Aidan, when you’re hitting the rate limits, you need to slot a time.sleep() in the for loop, with the appropriate number of seconds. Details on rate limiting: https://dev.twitter.com/rest/public/rate-limiting
  
  LikeLike
  
  Reply
Pingback: Reactions to Terror: Twitter response to the Nice attack – Unsupervised Learning
Pingback: Data Mining: Python | hey trpcage
Michael Finlay says:

August 3, 2016 at 10:01 pm

Marco,

Great page. Very helpful. I am a beginner who is attempting to use your guide to locate the geographical origin of certain hashtags.

For the process_or_store() function, how do I tweak its definition so that it saves the stream of tweets I collect as a json file (such as the ‘mytweets’ one you refer to in the second part of this guide).

Thanks in advance.

LikeLike

Reply
1. Marco says:
  
  August 4, 2016 at 5:22 am
  Hi Michael, thanks for the comment.
  
  when downloading your tweets, you can for example do something like:
```
with open('my_tweets.jsonl', 'w') as f:
    for tweet in tweepy.Cursor(api.user_timeline).items():
        f.write(json.dumps(tweet._json)+"\n")
```
  (here you don’t even need a custom process_or_store() )
  
  For the streaming part, for example to download tweets with a given hashtag, the MyListener class defined in the article already stores the tweets on a file.
  
  Also, please have a look at this: https://marcobonzanini.com/2016/08/02/mastering-social-media-mining-with-python/ (chapter 2, about mining Twitter, is available as a free sample from the publisher’s web site, and more sample code is available on my Github).
  
  Cheers,
  Marco
  
  LikeLiked by 1 person
  Reply
  1. Michael Finlay says:
    
    August 4, 2016 at 8:12 pm
    
    Marco,
    
    I’ll give that a go now. Thanks for your help.
    
    LikeLike
Peter O'Leary says:

August 5, 2016 at 9:33 pm

Marco,

Is there an easy way to put a time limit on the stream of data?

I was thinking of something like
import time
start_time =time.time()
end_time = start_time + 10

…

if end_time < time.time():
sys.exit()

So my code looks like this:

import time

..

start_time =time.time()
end_time = start_time + 10

from tweepy import Stream
from tweepy.streaming import StreamListener

class MyListener(StreamListener):

def on_data(self, data):
start = time.time()
try:
with open('fresh2.json', 'a') as f:
f.write(data)
return True
except BaseException as e:
print('Error on_data', str(e))
return True

finally:
if end_time < time.time():
sys.exit()

But this does not work? Thanks in advance. Excellent page.

LikeLike

Reply
1. Marco says:
  
  August 8, 2016 at 6:21 pm
  Hi Peter,
  off the top of my head, the simplest option if you want to achieve this programmatically, you could set the desired end-time in the constructor of the custom listener, and then check for the timing when you received the data e.g.
```
class MyListener(StreamListener):
    def __init__(self):
        self.end_time = time.time() + 10

    def on_data(self):
        if time.time() > self.end_time:
            sys.exit()
        # plus the rest of the class definition
```
  (I haven’t tested this code, might need some tweaking)
  one problem with this approach is that the check is triggered only if you receive data, so for a very low-volume stream you might keep the connection going for more than 10 seconds.
  
  Another option is to use some OS facility, e.g. Linux has a timeout command (also available on mac via “brew install coreutils”, in this case called gtimeout):
```
$ gtimeout 10s python my_file.py
```
  this will run the python script and kill it after 10 seconds.
  
  Cheers,
  Marco
  
  LikeLike
  Reply
Ahmed says:

August 15, 2016 at 3:38 am

Hi Marco

I bought your “Mastering Social Media Mining with Python” book, and I’m trying to get the accompanying code on Github to work. I’m at twitter_streaming, which I haven’t been able to get working.

Following the instructions in the book, I did the following
– created a virtual environment
– set the four parameters (customer key and secret, access token and secret) as environment variables
– ran the twitter_client.py script (for authentication) from windows cmd prompt as follows
$ python twitter_client.py
– ran the twitter_streaming.py script from windows cmd prompt with the keywords as follows
$ python twitter_streaming.py keyword1 keyword2 keyword 3

I’m getting 401 errors on running twitter_streaming.py. I tried to abort/exit the execution using Ctrl+C but no success – I’m unable to abort/exit the execution, and i’m getting a continuous stream of 401 errors.

My questions
– how to make sure that twitter_client.py was successfully executed?
– why do you think I’m getting 401 errors on executing twitter_streaming.py? and how can I abort/exit the execution to get to the prompt?
– is there a way to create the virtual environment and set the virtual environment variables and execute the accompanying scripts from inside the python interpreter rather than from the cmd prompt?

Note:
– I’ve been using python for about a year now, but still consider myself a beginner.

Thank you in advance
Ahmed

LikeLike

Reply
Marco says:

August 15, 2016 at 11:02 am
Hi Ahmed

error 401 is usually given because the credentials are incorrect, so possibly a copy-paste mismatch? If the variables are not set at all for the current session, the script would raise an error and quit.

You don’t need to call twitter_client.py explicitly, because it’s used by the other scripts to set up the authentication.If you set the environment variables from the command line, keep in mind that these variables are scoped in your existing session, i.e. when you close the console window they are not kept. I usually put all the environment setting commands in a single shell script that I run once per session (I also make sure that it’s ignored by the source control and not checked in). You can check the value of these variables using “echo %VAR_NAME%”, for example echo %TWITTER_CONSUMER_KEY%

If you’re not comfortable with defining the environment in this way, you can still hard-code the values, e.g. in twitter_client.py you can replace the get_twitter_auth() function:
```
def get_twitter_auth():
    consumer_key = 'YOUR-CONSUMER-KEY-HERE'
    consumer_secret = 'YOUR-CONSUMER-SECRET-HERE'
    access_token = 'YOUR-ACCESS-TOKEN-HERE'
    access_secret = 'YOUR-ACCESS-SECRET-HERE'
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)
    return auth
```
This is not a great design as explained in Chapter 2 (separation of concerns between app logic and config), but it’s good enough to get you started, just make sure that you don’t accidentally push your personal keys to github :)
Cheers,
Marco

LikeLike
Reply
Pingback: Social Media Analysis and Python – Learning Python
Pingback: A Look at Trump and Clinton’s Tweets Using Tweepy – Part 1: Popularity Metrics – Keith Selover
Pingback: Twitter Data Mining – CS310
G Aditya says:

September 17, 2016 at 4:04 am

How to convert the JSON-dictionary to CSV?

LikeLike

Reply
1. Marco says:
  
  September 18, 2016 at 6:49 am
  
  Hi
  JSON can represent a nested structure, while CSV only a flat record-like one, so in the general case you can’t directly map JSON to CSV. You first need to find a way to normalise the JSON data. Once you ensure your data structure is flat, you can use the Python csv module (csv.writer in particular) to produce the CSV output.
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Rob Janzen says:

September 18, 2016 at 1:05 am

Thanks for the great tutorial! I’m new to Python, and was stuck trying to parse the JSON data.

LikeLike

Reply
Pingback: Data Vis Progress – Week 5 – jussdesign
Pingback: Debate Night Twitter: Analyzing Twitter’s Reaction to the Presidential Debate. | Open Data Science
nayansinghal says:

October 29, 2016 at 4:51 pm

I am trying to gather all the tweets about a particular event through the streaming API. I am able to get data for #apple but when I am trying to stream data for #AirbnbWhileBlack, I am getting 1 tweet/30 min.

I have tried by registering a new app on Tweeter website, but stills facing the same issue. I have searched on Google but didn’t figure any solution for the problem. Does anyone have any idea, how can I resolve this problem? Or any website through which I can collect data on #AirbnbWhileBlack.

LikeLike

Reply
1. Marco says:
  
  October 30, 2016 at 9:38 am
  
  Hi nayansinghal,
  from what I can see, that hashtag has a very low frequency at the moment, i.e. only a few tweets per day, so what you’re observing is in fact correct. You can look into the Twitter Search API (also supported by tweepy) rather than the streaming used here, so you can go back to approximatively the last week (but some tweets/users might be missing from the results, as explained in the documentation):
  https://dev.twitter.com/rest/public/search
  http://docs.tweepy.org/en/v3.5.0/api.html#API.search
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
  1. nayansinghal says:
    
    October 30, 2016 at 6:12 pm
    
    Thanks Marco for your response. I checked the Twitter search API and got only 70 tweets overall. I searched on Google and figured out that these APIs will provide you tweet only for past 7 days. Is there any other method through which I can get tweets before 1 week?
    
    I know one of the solution is to extract tweet using web page scrapper but that doesn’t seem me a good idea.
    
    LikeLike
mona says:

November 4, 2016 at 8:03 am

in Line
with open(‘python.json’, ‘a’) as f:
it highlights with and says: “expected an indented block”

LikeLike

Reply
1. Marco says:
  
  November 5, 2016 at 8:18 am
  
  Hi,
  you need to check the correct indentation of the code as shown in the article (sometimes with copy&paste the indentation gets lost)
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Eduardo says:

November 4, 2016 at 7:42 pm

The documentation says the return type of the search method is a list of Search Results, how can I extract the tweets of this?

LikeLike

Reply
1. Eduardo says:
  
  November 4, 2016 at 10:25 pm
  
  I figured this out, but thanks!
  
  LikeLike
  
  Reply
  1. Marco says:
    
    November 5, 2016 at 8:15 am
    
    Great :)
    
    LikeLike
Eduardo says:

November 4, 2016 at 9:52 pm

I’m interested in getting tweets of somebody’s followers, but I only get up to 20 followers, even though this person has more, why does this happen?

LikeLike

Reply
1. Marco says:
  
  November 5, 2016 at 8:15 am
  
  Hi Eduardo, if you check out the examples from my book (Chap02-03), there are some examples of how to use pagination with Tweepy, I think you can easily rearrange them for your needs… have a look at https://github.com/bonzanini/Book-SocialMediaMiningPython
  Cheers
  Marco
  
  LikeLike
  
  Reply
OrangeDilip says:

November 7, 2016 at 11:45 am

Hi, I have a doubt.. I installed tweepy 3.3.0, however i am not able to import that in my jupyter notebook and in anaconda spyder.. Could you please help..

LikeLike

Reply
Utkarsh Sarswat says:

December 9, 2016 at 6:33 pm

Hey Sir,
I want to download one month old tweets regarding some keywords. How can I do that.

LikeLike

Reply
1. Marco says:
  
  December 10, 2016 at 10:06 am
  
  Hi, unfortunately the Twitter Search API only lets you go back in time to about two weeks, older tweets are not available.
  Cheers,
  Marco
  
  LikeLike
  
  Reply
anjanna says:

December 24, 2016 at 8:48 am

i got Error 401??

LikeLike

Reply
1. Marco says:
  
  December 24, 2016 at 9:07 am
  
  Hi anjanna, error 401 happens because of unauthorised access: https://dev.twitter.com/overview/api/response-codes
  You’ll need to check that your app is correctly registered and you have set your credentials in the script
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Iacopo says:

January 9, 2017 at 7:51 pm

I can filter out tweets with retweet_count less than X?
Thanks :)

LikeLike

Reply
1. Marco says:
  
  January 10, 2017 at 7:32 am
  Hi, I suggest you do some post-processing after the streaming is closed.
  For example:
```
import json

MIN_RT = 5

with open('tweets.json', 'r') as fin, open('tweets_filtered.json', 'w') as fout:
    for line in fin:
        tweet = json.loads(line)
        if tweet['retweet_count'] >= MIN_RT:
            fout.write(line)
```
  This will read your tweets.json file and create a tweets_filtered.json with only tweets that have been retweeted at least 5 times
  Cheers,
  Marco
  
  LikeLike
  Reply
  1. Iacopo says:
    
    January 10, 2017 at 7:37 pm
    
    You’re fast! thanks for your answer.
    Ciao :)
    
    LikeLike
Sameque says:

January 30, 2017 at 4:16 pm

Good afternoon, I’m a PhD student and I have a question. Would it be possible to get user data within a “Zone”? Of all the users that use twitter within that zone. Thank you very much in advance.

LikeLike

Reply
1. Marco says:
  
  January 31, 2017 at 8:04 am
  
  Hi Sameque, the streaming API allows you to define a specific location using geo coordinates as described here: https://dev.twitter.com/streaming/overview/request-parameters#locations
  
  Cheers
  Marco
  
  LikeLike
  
  Reply