Mastering Social Media Mining with Python


Great news, my book on data mining for social media is finally out!

The title is Mastering Social Media Mining with Python. I’ve been working with Packt Publishing over the past few months, and in July the book has been finalised and released.


As part of Packt’s Mastering series, the book assumes the readers already have some basic understanding of Python (e.g. for loops and classes), but more advanced concepts are discussed with examples. No particular experience with Social Media APIs and Data Mining is required. With 300+ pages, by the end of the book, the readers should be able to build their own data mining projects using data from social media and Python tools.

A bird’s eye view on the content:

  1. Social Media, Social Data and Python
    • Introduction on Social Media and Social Data: challenges and opportunities
    • Introduction on Python tools for Data Science
    • Overview on the use of public APIs to interact with social media platforms
  2. #MiningTwitter: Hashtags, Topics and Time Series
    • Interacting with the Twitter API in Python
    • Twitter data: the anatomy of a tweet
    • Entity analysis, text analysis, time series analysis on tweets
  3. Users, Followers, and Communities on Twitter
    • Analysing who follows whom
    • Mining your followers
    • Mining communities
    • Visualising tweets on a map
  4. Posts, Pages and User Interactions on Facebook
    • Interacting the Facebook Graph API in Python
    • Mining you posts
    • Mining Facebook Pages
  5. Topic analysis on Google Plus
    • Interacting with the Google Plus API in Python
    • Finding people and pages on G+
    • Analysis of notes and activities on G+
  6. Questions and Answers on Stack Exchange
    • Interacting with the StackOverflow API in Python
    • Text classification for question tags
  7. Blogs, RSS, Wikipedia, and Natural Language Processing
    • Blogs and web pages as social data Web scraping with Python
    • Basics of text analytics on blog posts
    • Information extraction from text
  8. Mining All the Data!
    • Interacting with many other APIs and types of objects
    • Examples of interaction with YouTube, Yelp and GitHub
  9. Linked Data and the Semantic Web
    • The Web as Social Media
    • Mining relations from DBpedia
    • Mining geo coordinates

The detailed table of contents is shown on the Packt Pub’s page. Chapter 2 is also offered as free sample.

Please have a look at the companion code for the book on my GitHub, so you can have an idea of the applications discussed in the book.

Published by


Data Scientist

22 thoughts on “Mastering Social Media Mining with Python”

  1. I bought your book to help me learn about the Twitter API. The tip about the JSON Lines format (.jsonl) has already cleared up a lot of confusion I had regarding storing and retrieving tweets. Thanks a lot. By the way, the code I downloaded for had zoom_start set to 17, which is too high to see both London and Paris in the same window. The book has this value at 5, which is much better.

    Liked by 1 person

    1. Hi Alex,
      that excessive zoom was probably me fooling around with the examples, but as you noticed the book has the correct/intended value. I’ve reverted back to the right value on the github repo, many thanks for reporting it!

      Liked by 1 person

  2. Hi Marco, Im running a small lab in Soweto teaching young to code for free can we use the code from github as class material without the actual book purchase due to no budget.
    We love your work keep going.

    Liked by 1 person

  3. Hi Marco!

    I’m working with python 3.5.0.
    This snippet appears in chapter 1, section 3

    >>> from nltk.tokenize import TweetTokenizer
    >>> tokenizer = TwitterTokenizer()
    >>> tweet = ‘@marcobonzanini: an example! :D #NLP’
    >>> print(tokenizer.tokenize(tweet))
    # [‘@marcobonzanini’, ‘:’, ‘an’, ‘example’, ‘!’, ‘:D’, ‘’, ‘#NLP’]

    the second line:

    >>> tokenizer = TwitterTokenizer()

    returns this:

    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘TwitterTokenizer’ is not defined

    I’m guessing that TwitterTokenizer is no longer provided by nltk?

    Where could I go to for support on this issue?

    Liked by 1 person

    1. Hi Bob, this will go in the “errata” section
      The correct line should be:

      tokenizer = TweetTokenizer()

      (i.e. “Tweet” instead of “Twitter”, like the class just imported in the first line).
      Thanks for reporting this.

      Liked by 1 person

  4. Marco. Thanks for publishing this book. I do have one problem though. It is not your book. But, Twitter seems to be rejecting authentication calls to OAuth from Tweepy. I have quadruple checked my code and my access keys. But, I still get the 214 error “Bad Authentiication Data”. I also tried using Twython and got the same result. It seems to be an ongoing problem with Twitter based on Stackoverflow comments. Do you know anything about this? Thanks!


  5. Hi Lorne
    error 215 Bad Authentication Data happens when you don’t authenticate or when you send empty authentication, so assuming you’re following all the right steps to set up the app I’m not sure where to look. I’ve tested again the code as it is from the book with old and new apps (e.g. re-using existing access keys and getting new ones) and everything is smooth.


    1. Thank you, Marco. I will take another look at my code. I copied it straight from the book. But, I could have made an error that I didn’t catch the first five times I looked. I am sorry to bother you with this.

      Warmest regards,


      On Wed, Mar 15, 2017 at 4:32 AM, Marco Bonzanini wrote:

      > Marco commented: “Hi Lorne error 215 Bad Authentication Data happens when > you don’t authenticate or when you send empty authentication, so assuming > you’re following all the right steps to set up the app I’m not sure where > to look. I’ve tested again the code as it is from ” >


  6. Sir, excellent book my humble suggestion is in future please consider writing a book about text analysis. I am doing masters in computational engineering my project involves collecting data for sentiment analysis. Since I am from India lot text (comments or tweets) that I collected from Twitter and Facebook are in transliterated or code mix form. This is really challenging because there is no tool or library that addresses the morphological complexity of the code- mix text. I need a suggestion how to deal with such complex nature. How can I classify sentiment for this kind of text?
    I would happy to share with you some complexity involved in my project in personal with you in the form of email and my research interest.

    Thank you.


  7. Dear Marco thank you for the inspiring book. I am having a hard time with Twitter error response: status code = 400. I think I have looked everywhere online and offline to solve the problem without any result. I really appreciate if you can help me.


  8. Dear Marco, I too wanted to thank you for this excellent book. Two remarks from my side:
    – in the Chap04 example, you should replace = graph.get_connections(‘PacktPub’,
    in order to collect posts from the right page (and not PacktPub…)
    – do you plan to develop a LinkedIn section in the future? This is really missing.


    1. Thanks for the kind words. To answer your suggestions:
      – the problem with has been documented and fixed in the github repo a while ago (thanks to a previous suggestion from a translator); I use the github repo to keep track of the corrections if anything pops up (thanks for the suggestion anyway!)
      – the problem with unicode is related to using Python 2. My suggestion is to upgrade to Python 3 as soon as possible because Python 2 is getting close to its end-of-life date; the code from the book is written in Python 3, and doesn’t support Python 2 explicitly although many snippets can work without too much trouble. Amongst other reasons for upgrading to Python 3, unicode is much less painful in Py3 than it is in Py2. If you’re stuck with Python 2, your suggestion does indeed help, but my recommendation is to upgrade asap :)
      – I’m not planning any new edition at the moment. I have to say that originally I was planning to include LinkedIn when I was thinking about the book outline a few years ago, but then their API became very limited for public use, i.e. you have to pay to get access to the most interesting analytics features, and the public access is not very interesting, so I preferred to focus on other tools that people can openly use without paying a fee.

      Thanks again for the nice words



  9. Also, if you plan to display messages with unicode character (for french speakers, for exaple), you should recommend to add from __future__ import unicode_literals at the beginning of


  10. Dear Marco, I have a quick question: in your book you have the example of streaming with the following command
    python \#RWC2015 \#RWCFinal rugby
    I would like to know how to save the file, where is it going. because if i write that sentence in my computer, it keeps thinking and does not produce any outcome. actually I need to stop it with ctrl+C.

    thank you very much


    1. Hi Raul, the output of is stored in a file called stream_[your query].jsonl where [your query] is the set of keywords or hashtags that you track, in this case “stream__RWC2015__RWCFinal_Rugby.jsonl” (notice the # symbol and others are converted to underscores). The file will be in the same folder as the script, and it will be created only when you receive the first tweet.
      If you’re tracking #RWC2015 today, probably you won’t see any new tweet coming through (that’s why no file was being saved), so I’d suggest you test the script with other keywords that are not related to past events like the 2015 Rugby World Cup

      Best regards


Comments are closed.