Mastering Social Media Mining with Python

book-cover

Great news, my book on data mining for social media is finally out!

The title is Mastering Social Media Mining with Python. I’ve been working with Packt Publishing over the past few months, and in July the book has been finalised and released.

Links:

As part of Packt’s Mastering series, the book assumes the readers already have some basic understanding of Python (e.g. for loops and classes), but more advanced concepts are discussed with examples. No particular experience with Social Media APIs and Data Mining is required. With 300+ pages, by the end of the book, the readers should be able to build their own data mining projects using data from social media and Python tools.

A bird’s eye view on the content:

  1. Social Media, Social Data and Python
    • Introduction on Social Media and Social Data: challenges and opportunities
    • Introduction on Python tools for Data Science
    • Overview on the use of public APIs to interact with social media platforms
  2. #MiningTwitter: Hashtags, Topics and Time Series
    • Interacting with the Twitter API in Python
    • Twitter data: the anatomy of a tweet
    • Entity analysis, text analysis, time series analysis on tweets
  3. Users, Followers, and Communities on Twitter
    • Analysing who follows whom
    • Mining your followers
    • Mining communities
    • Visualising tweets on a map
  4. Posts, Pages and User Interactions on Facebook
    • Interacting the Facebook Graph API in Python
    • Mining you posts
    • Mining Facebook Pages
  5. Topic analysis on Google Plus
    • Interacting with the Google Plus API in Python
    • Finding people and pages on G+
    • Analysis of notes and activities on G+
  6. Questions and Answers on Stack Exchange
    • Interacting with the StackOverflow API in Python
    • Text classification for question tags
  7. Blogs, RSS, Wikipedia, and Natural Language Processing
    • Blogs and web pages as social data Web scraping with Python
    • Basics of text analytics on blog posts
    • Information extraction from text
  8. Mining All the Data!
    • Interacting with many other APIs and types of objects
    • Examples of interaction with YouTube, Yelp and GitHub
  9. Linked Data and the Semantic Web
    • The Web as Social Media
    • Mining relations from DBpedia
    • Mining geo coordinates

The detailed table of contents is shown on the Packt Pub’s page. Chapter 2 is also offered as free sample.

Please have a look at the companion code for the book on my GitHub, so you can have an idea of the applications discussed in the book.

PyData London 2016 write-up

Last weekend I was at the PyData London conference for three Pythonic days. Firstly, thanks to the organiser, volunteers, speakers, sponsors and everyone who has contributed in a way or another to make the event a great success.

This year I had the opportunity to contribute as member of the review committee, which means I had a glimpse at the behind-the-scenes and I know how many great proposals we had. With three days and three to four tracks running in parallel, there is room for a lot of Pythonic parley, yet unfortunately many good proposals had to be turned down due to time/space constraints. The programme turned out to be great nevertheless.

The three days were really intense so there is just too much to say, but I’ll try to summarise some of the take-home messages.

Tutorials: delivering a tutorial is difficult. Everything that could go wrong, will go wrong (big screen that goes bananas for 10 minutes, flaky Internet connection so a conda install takes ages, you mention it). Jupyter notebook makes life better, but I strongly feel for the speakers, so a big thank you for taking the time to prepare some quality material.

Topics of interest: some topics seem to capture most of the attention this year, in particular there was a lot of interest around data pipelines, deep learning and Bayesian stats. Unsurprising?

Keynotes: following the recent news on the LIGO project, Prof. Andreas Freise gave an introduction to gravitational waves, lasers, the latest achievements in physics and other cool things far beyond my understanding. Something I could understand and relate to is his way to describe how he needs to write code to carry on his job, but writing code is not his main job. This is true for many academics and researchers without a software engineering background, who were also the main audience of my talk on building data pipelines (luckily enough, scheduled right after the keynote in the same room).

The second keynote, given by Tetiana Ivanova, was about the beginning of her journey in Data Science without formal education. Some of the suggestions were sensible, in fact I recently shared some of the same ideas in a short talk to UCL students and post-docs who want to move to industry.

The third and last keynote was given by Travis Oliphant: CEO of Continuum Analytics, author of NumPy, creator of SciPy, Pythonista since the late 1990’s. His talk was about scaling up and scaling out the PyData stack. Things to watch out for: Numba and Dask. Really exciting stuff going on!

My talk: I presented “Building Data Pipelines in Python”, with a focus on the need to bring R&D and Engineering together, and how basic engineering principles can be beneficial even if your job is not all about writing code. After presenting a very similar talk at PyCon Italy, I found the audience in London to be a bit more on the academic side than I initially thought, which was perfect for my engineering rants. After the usual first few minutes of feeling awkward when speaking publicly, I started my discussion on unit testing and asked how many in the audience write unit tests regularly. Random guy from the audience: “What’s a unit test?”. Thank you kind stranger, you lifted my spirit and the rest of the talk was a breeze.

The slides of my talk are on my speakerdeck.

Last year it took several months to get the videos out, this year only one day! So this is the video of my talk: https://www.youtube.com/watch?v=7NzH1Gx8-4E

I had some interesting questions after the talk and I also had some nice conversations the day after. Apparently, I raised some interest on Luigi, in fact a few people told me how they really had to attend the other talk about using Luigi in production, deliverd by Pete Owlett from Deliveroo, after listening to mine (the room was overflowing so I couldn’t even get close!). There was also some genuine interest on unit testing, and a very interesting question was how to apply it when working with Jupyter notebooks.

Lighting talks: apparently, saving your Jupyter notebooks on git is an issue that is taken very seriously by the community. In fact, three speakers came up with different solutions for the same problem.

Organisation: hat off to the organisers and everyone involved, and see you at the PyData London meetup!

Get in touch if you also have a write up of the event:

@MarcoBonzanini

PyCon Italia / PyData Italy 2016 Write-Up

Last week I’ve travelled to Florence to attend PyCon Sette, the seventh edition of the Italian Python Conference, born 10 years ago and held annually (with three editions of EuroPython in between).

First off, I have something to admit: as this was my first time at PyCon Italia, clearly I didn’n know what I was missing. Being overly busy with work and side projects, this is the perfect excuse to resume the blog.

Florence

The city doesn’t need much presentation: it’s simply one of the most beautiful cities in the world. I haven’t been there for a few years but things don’t seem to be very different from a turist’s point of view. The craft beer scene is booming, but at the same time culinary traditions are well preserved. Both of these are big thumbs-up for me. The best random moment of my trip: getting lost in the back streets of the old city centre, and then finding a dodgy hole-in-the-wall place that sells incredible focaccia and panini.

The Conference

PyCon Sette can be summarised as three intense days of Python, with more than 500 attendees. The first day was opened by Alex Martelli with a keynote about exception handling in Python 2 vs Python 3. A part from the keynotes, at any given time we had between 4 and 6 parallel sessions of talks or trainings. I decided to stick to the PyData track for the whole time, although the other tracks were also featuring some interesting talks. Some of the tracks were related to a particular sub-community, with PyData and DjangoVillage having a strong presence, but also Odoo, DjangoGirls and the Italian Postgres User Group are worth mentioning.

I’ve listened to many interesting talks. On top of my head, a few to remember: the talk about Internet of Things by Stefano Terna of TomorrowData.io (also winners of the start-up contest), the one about deployment of scikit-learn models in the cloud by Alex Casalboni and an interesting one about Functional Programming and Dask by Holger Peters.

Overall, hats off to the organisers. In particular, I had some conversations with Valerio Maggio who is the founder of PyData Italy. We exchanged some opinions about the conference and the community in the broader sense. Hopefully the interest around Data Science in Italy will keep rising, so maybe several local events throughout the year will be held, rather than having just one big national event per year.

My Talk

On Saturday, I gave a talk on Building Data Pipelines in Python. I wrote about building data pipelines with Luigi before, but this talk gave me the opportunity to look at the bigger picture. The general message was that Research and Engineering are different disciplines, but we (data-sciency and researchy people) can benefit from trying to meet in the middle. In particular, good engineering practices can help the less engineering-oriented researchers in their day-to-day mundane tasks. After opening the discussion on the overall topic, I had a brief moment of ranting about unit testing (or the lack of testing culture in some academic circles), I introduced Luigi as a workflow manager to build pipelines in Python and I closed with an overview on logging (described by Alex Martelli in his keynote as something that scares people off, at least initially) and a consideration about using good engineering practices in research.

The talk was addressed to beginners and to the less engineering-savvy PyData users, so expert software engineers probably didn’t benefit much from it. I had anyway a good response with several people coming after the talk for a chat. All in all, if at least one researcher will look into testing or will decide to try one of the workflow managers I mentioned, I’d say I’ve reached my goal.

The slides of my talk are on my speakerdeck (videos will be on-line soon).

See you next year in Florence!

Retrocomputing and Python: import turtle

My first experience with something related to programming was back in middle school. From time to time, our Math-and-Science substitute teacher used to walk us to the computer room, which was full of shiny Commodore 64 machines, where we had a lot of fun (sort of) with a graphic tool called turtle. What we were trying to do was simply to give a list of instructions to a turtle-shaped cursor, so it could move on the screen and draw some colourful shapes.

Back in those days, we didn’t even realise that we were doing something programming-related, we simply thought we were skipping Math for one day. Fast-forward a few years later, I found out about Logo, its value as educational programming language and Turtle graphics as one of Logo’s key features.

Given the festive spirit of these days, I thought I’d give a shot at the turtle package, part of the Python standard library ;)

Quick Intro on Turtle Graphics

Python has its own implementation of the turtle as part of the standard library (see documentation here). It uses the tkinter module for the underlying graphics, so it has to be run with a version of Python with Tk support.

If you’ve never heard of Turtle Graphics, these are some of the core concepts:

  • The turtle has a position (x, y coordinates) and an orientation
  • The orientation can be changed with right/left commands, e.g. right(90) will rotate 90 degrees clockwise
  • The position can be changed with forward/backward commands, or by setting the coordinates explicitly
  • The turtle is also called pen: when the pen is down, moving the turtle will draw a line

import turtle

The starting point is simply to import the turtle module. A turtle program will have a turtle.Screen object as a drawing canvas, and a turtle.Turtle object as a pen.

Let’s consider this first example:

import turtle

if __name__ == '__main__':
    win = turtle.Screen()

    turt = turtle.Turtle()
    turt.forward(100)
    turt.left(90)
    turt.forward(30)
    turt.color("red")
    turt.forward(30)

    win.mainloop()

This will produce the following:

Turtle Example

The turtle is initially oriented towards the right-hand side of the screen, i.e. towards 3 o’clock. Moving forward will produce the initial black line. As you can see the colour can be changed later using the turtle.color() function.

Festive Turtle

This paragraph shows a more complex example. The full code is available on GitHub

import turtle

if __name__ == '__main__':
    wn = turtle.Screen()

    my_turtle = turtle.Turtle()

    # start drawing the tree
    my_turtle.color("darkgreen")
    my_turtle.pensize(5)
    my_turtle.begin_fill()
    # the right half of the tree
    my_turtle.forward(100)
    my_turtle.left(150)
    my_turtle.forward(90)
    my_turtle.right(150)
    my_turtle.forward(60)
    my_turtle.left(150)
    my_turtle.forward(60)
    my_turtle.right(150)
    my_turtle.forward(40)
    my_turtle.left(150)
    my_turtle.forward(100)
    # the left half of the tree
    my_turtle.left(60)
    my_turtle.forward(100)
    my_turtle.left(150)
    my_turtle.forward(40)
    my_turtle.right(150)
    my_turtle.forward(60)
    my_turtle.left(150)
    my_turtle.forward(60)
    my_turtle.right(150)
    my_turtle.forward(90)
    my_turtle.left(150)
    my_turtle.forward(133)

    my_turtle.end_fill()
    # the trunk
    my_turtle.color("brown")
    my_turtle.pensize(1)
    my_turtle.begin_fill()

    my_turtle.right(90)
    my_turtle.forward(70)
    my_turtle.right(90)
    my_turtle.forward(33)
    my_turtle.right(90)
    my_turtle.forward(70)

    my_turtle.end_fill()

    # the star, see similar example on python.org
    my_turtle.penup()
    my_turtle.setpos(-17, 110)
    my_turtle.color("gold")
    my_turtle.begin_fill()
    my_turtle.pendown()
    for _ in range(36):
        my_turtle.forward(40)
        my_turtle.left(170)
    my_turtle.end_fill()


    # some colourful balls
    def ball(trt, x, y, size=10, colour="red"):
        trt.penup()
        trt.setpos(x, y)
        trt.color(colour)
        trt.begin_fill()
        trt.pendown()
        trt.circle(size)
        trt.end_fill()

    ball(my_turtle, 95, -5)
    ball(my_turtle, -110, -5)
    ball(my_turtle, 80, 40, size=7, colour="gold")
    ball(my_turtle, -98, 40, size=7, colour="gold")
    ball(my_turtle, 70, 70, size=5)
    ball(my_turtle, -93, 70, size=5)


    my_turtle.hideturtle()
    wn.mainloop()

And this is the output:

Turtle XMas Tree

Summary

Turtle graphics is a great educational tool to introduce kids to programming. Grown-ups can use it as well, for a bit of nostalgic fun ;)

The full code for the demo is available on GitHub

@MarcoBonzanini

Adding Slack Notifications to a Luigi Pipeline in Python

In a previous article, I’ve described how to build a data pipeline in Python using Luigi, a workflow manager written in Python and open sourced by Spotify. I also had the opportunity to give a short talk about Luigi at the local PyData London meetup (see slides).

One of the nice features of Luigi is the possibility of receiving e-mail notifications on error. While this is a useful feature, it’s tailored to errors only, so effectively you don’t know if the Luigi pipeline has completed its execution successfully, unless you manually check. As I wanted to extend the possibility of receiving a notification on Slack, also in case of success, I started looking around for the options.

I ended up developing my own solution: https://github.com/bonzanini/luigi-slack. This blog post is a brief overview on how to use this Python package with your Luigi pipeline.

Getting started with luigi-slack

From your organisation’s Slack page (e.g. yourname.slack.com) you can add a Bot integration. The setup is very quick, and you’ll receive a token that you’ll need to use to interact with the Slack API.

You can get the bleeding edge version of luigi-slackfrom the GitHub link above, but beware that this is a work in progress. A somewhat stable version is available from the cheese shop:

pip install luigi-slack

The key points of this package are:

  • Support for Python 3
  • Easy-to-use interface

Regarding the first point, the discussion on choosing Python 2 vs Python 3 is still never-ending and I’m not going there in this post. For a greenfield project, I prefer to use Python 3 rather than a version with a sunset date already decided. The support for Python 2 in luigi-slack is best-effort (and of course pull requests are always welcome).

In terms of easy-to-use interface, I borrowed the nice idea of using a context manager from luigi-monitor, because it makes it easy to integrate the library with an existing pipeline.

For example, given the basic code to run a Luigi pipeline which ends with the task YourTaskClass:

import luigi

if __name__ == '__main__':
    luigi.run(main_task_cls=YourTaskClass)

All we need in order to have Slack notifications is to refactor as follows:

import luigi
from luigi_slack import SlackBot, notify

if __name__ == '__main__':
    slacker = SlackBot(token='my-token',
                       channels=['mychannel', 'anotherchannel'])
    with notify(slacker):
        luigi.run(main_task_cls=YourTaskClass)

Configuration Options for luigi-slack

The SlackBot takes a number of arguments. Besides the token, which allows you to connect to your organisation’s Slack, all the other parameters are optional:

  • channels (default empty list) is the list of channel names that you want to push the notifications to. For the channel name, you don’t need the initial # symbol. You can also deliver the notifications to a single account, by using the @username syntax
  • events (default to [FAILURE]) is the list of event types, as defined in luigi_slack, that you want to track
  • max_events (default to 5) is the max number of events of a given type that you want to report. With more than max_events events of the same type, a “please check logs” message is reported instead
  • username (default to “Luigi-slack Bot”) is the screen name of your bot
  • task_representation (default to str) is the function used to represent the task in the notification (see explanation below)

In Luigi, representing a task as a string will print the task_id attribute of a luigi.Task, which include the class name as well as all the parameters. In other words, it looks like:

MyTask(param1=”some_value”, param2=”other_value”, your_secret_param=”your_secret_value”, …)

With a huge number of parameters that make the notification less readable, or with sensible parameters that you don’t want to send around in the Slack chat room, it makes sense to display the task a more conservative way. An example of custom string representation could be:

def custom_task_representation(task):
    return "{}(...)".format(task.__class__.__name__)

Once we pass the function as task_representation argument of the SlackBot, the task will appear in the notifications as:

MyTask(…)

Keep in mind that an instance of a Luigi task is identified by the class name AND the value of its parameters, which is why the task_id include them all. In other words, with a more compact representation like the one proposed in the above snippet, you won’t be able to distinguish between tasks with the same class name, but different param values. You’ll need to customise the function based on your needs.

Summary

I’m developing a Python package to include Slack notification support to a Luigi pipeline, with a simple interface, a few optional configuration parameters, and minimal requirements in terms of refactoring.

The code is available at https://github.com/bonzanini/luigi-slack, and you can install the Python package with:

pip install luigi_slack

As this is a work in progress, it’s not widely tested, and the interface could change. Comments and pull requests are welcome.

Building Data Pipelines with Python and Luigi

As a data scientist, the emphasis of the day-to-day job is often more on the R&D side rather than engineering. In the process of going from prototypes to production though, some of the early quick-and-dirty decisions turn out to be sub-optimal and require a decent amount of effort to be re-engineered. This usually slows down innovation, and generally speaking your project as a whole.

This post will discuss some experience in building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data, in general all the steps that are necessary to prepare your data for your data-driven product. In particular, the focus in on data plumbing, and how a workflow manager like Luigi can come to the rescue, without getting in your way. With a minimal effort, the transition from prototype to production can be smoother.

You can find the code for the examples as GitHub Gist.

Early Days of a Prototype

In the early days of a prototype, the data pipeline often looks like this:

$ python get_some_data.py
$ python clean_some_data.py
$ python join_other_data.py
$ python do_stuff_with_data.py

This is quite common when the data project is in its exploratory stage: you know that you’ll need some pre-processing, you think it’s going to be a quick hack, so you don’t bother with some engineering best practices, then the number of scripts grows and your data pipeline will come back and bite you.

This approach has the only advantage of being quick and hacky. On the downside, it’s tedious: every time you want to re-run the pipeline, you need to manually call the bunch of scripts in sequence. Moreover, if you’re sharing this prototype with a colleague, there is even more room for misinterpretation (“why can’t I do stuff with data?”… “did you clean it first?”, etc.)

The obvious hacky solution seems to be: let’s put everything in one script. After some quick refactoring, the do_everything.py script can look like this:

if __name__ == '__main__':
    get_some_data()
    clean_some_data()
    join_other_data()
    do_stuff_with_data()

This is fairly simple to run:

$ python do_everything.py

(Note: you could also put everything in a bash script, which calls the indiviaul bunch of scripts in sequence, but the shortcomings will be more or less the same)

Boilerplate Code

When moving towards a production-ready pipeline, there are a few more aspects to consider besides the run-everything code. In particular, error handling should be taken into account:

 
try:
    get_some_data()
except GetSomeDataError as e:
    # handle this

But if we chain all the individual tasks together, we end up with a Christmas tree of try/except:

try:
    get_some_data()
    try:
        clean_some_data()
        try:
            # you see where this is going...
        except EvenMoreErrors:
            # ...
    except CleanSomeDataError as e:
        # handle CleanSomeDataError
except GetSomeDataError as e:
    # handle GetSomeDataError

Another important aspect to consider is how to resume a pipeline. For example, if the first few tasks are completed, but then an error occurs half-way through, how do we re-run the pipeline without re-executing the initial successful steps?

# check if the task was already successful
if not i_got_the_data_already():
    # if not, run it
    try:
        get_some_date()
    except GetSomeDataError as e:
        # handle the error

Moving to Luigi

Luigi is a Python tool for workflow management. It has been developed at Spotify, to help building complex data pipelines of batch jobs. To install Luigi:

$ pip install luigi

Some of the useful features of Luigi include:

  • Dependency management
  • Checkpoints / Failure recovery
  • CLI integration / parameterisation
  • Dependency Graph visualisation

There are two core concepts to understand how we can apply Luigi to our own data pipeline: Tasks and Targets. A task is a unit of work, designed by extending the class luigi.Task and overriding some basic methods. The output of a task is a target, which can be a file on the local filesystem, a file on Amazon’s S3, some piece of data in a database etc.

Dependencies are defined in terms of inputs and outputs, i.e. if TaskB depends on TaskA, it means that the output of TaskA will be the input of TaskB.

Let’s look at a couple of template tasks:

# Filename: run_luigi.py
import luigi

class PrintNumbers(luigi.Task):

    def requires(self):
        return []

    def output(self):
        return luigi.LocalTarget("numbers_up_to_10.txt")

    def run(self):
        with self.output().open('w') as f:
            for i in range(1, 11):
                f.write("{}\n".format(i))

class SquaredNumbers(luigi.Task):

    def requires(self):
        return [PrintNumbers()]

    def output(self):
        return luigi.LocalTarget("squares.txt")

    def run(self):
        with self.input()[0].open() as fin, self.output().open('w') as fout:
            for line in fin:
                n = int(line.strip())
                out = n * n
                fout.write("{}:{}\n".format(n, out))
                
if __name__ == '__main__':
    luigi.run()

This code showcases two tasks: PrintNumbers, that writes the number from 1 to 10 into a file called numbers_up_to_10.txt, one number per line, and SquaredNumbers, that reads the such file and outputs a list of pairs number-square into squares.txt, also one pair per line.

To run the tasks:

$ python run_luigi.py SquaredNumbers --local-scheduler

Luigi will take care of checking the dependencies between tasks, see that the input of SquaredNumbers is not there, so it will run the PrintNumbers task first, then carry on with the execution.

The first argument we’re passing to Luigi is the name of the last task in the pipeline we want to run. The second argument simply tells Luigi to use a local scheduler (more on this later).

You could also use the luigi command:

$ luigi -m run_luigi.py SquaredNumbers --local-scheduler

Anatomy of a Task

To create a Luigi task we simply need to create a class whose parent is luigi.Task, and override some methods. In particular:

  • requires() should return the list of dependencies for the given task — in other words a list of tasks
  • output() should return the target for the task (e.g. a LocalTarget, a S3Target, etc.)
  • run() should contain the logic to execute

Luigi will check the return values of requires() and output() and build the dependency graph accordingly.

Passing Parameters

Hard-coding filenames and config values is generally speaking an anti-pattern. Once you’ve understood the structure and the dynamics of your task, you should look into parameterising all the configuration aspects so that you can dynamically call the same script with different arguments.

The class luigi.Parameter() is the place to look into. Each Luigi task can have a number of parameters. Let’s say for example that we want to modify the previous example to support a custom number. As the parameter we’re using with the range() function is an integer, we can use luigi.IntParameter rather than the default parameter class. This is how the modified tasks can look like:

class PrintNumbers(luigi.Task):
    n = luigi.IntParameter()

    def requires(self):
        return []

    def output(self):
        return luigi.LocalTarget("numbers_up_to_{}.txt".format(self.n))

    def run(self):
        with self.output().open('w') as f:
            for i in range(1, self.n+1):
                f.write("{}\n".format(i))

class SquaredNumbers(luigi.Task):
    n = luigi.IntParameter()

    def requires(self):
        return [PrintNumbers(n=self.n)]

    def output(self):
        return luigi.LocalTarget("squares_up_to_{}.txt".format(self.n))

    def run(self):
        with self.input()[0].open() as fin, self.output().open('w') as fout:
            for line in fin:
                n = int(line.strip())
                out = n * n
                fout.write("{}:{}\n".format(n, out))

To call the SquaredNumbers tasks up to, say, 20:

$ python run_luigi.py SquaredNumbers --local-scheduler --n 20

Parameters can also have default values, e.g.

n = luigi.IntParameter(default=10)

so in this way, if you don’t specify the --n argument, it will default to 10.

Sample code as GitHub Gist

Local vs Global Scheduler

So far, we’ve used the --local-scheduler option to run Luigi tasks with a local scheduler. This is useful for development, but in a production environment we should make use of the centralised scheduler (see the docs on the scheduler).

This has a few advantages:

  • avoid running two instances of the same task simultaneously
  • nice web-based visualisation

You can run the Luigi scheduler daemon in the foreground with:

$ luigid

or in the background with:

$ luigid --background

It will default to port 8082, so you can point your browser to http://localhost:8082 to access the visualisation.

With the global Luigi scheduler running, we can re-run the code without the option for the local scheduler:

$ python run_luigi.py SquaredNumbers --n [BIG_NUMBER]

As the sample code will run in milliseconds, if you want to have a chance to switch to the browser and see the dependency graph while the tasks are still running, you should probably use a big number like 10,000,000 or more for the --n option.

This is a cropped screenshot of the dependency graph:

dependency-graph-screenshot

Summary

We have described the definition of data pipelines using Luigi, a workflow manager written in Python. Luigi provides a nice abstraction to define your data pipeline in terms of tasks and targets, and it will take care of the dependencies for you.

In terms of code re-use, and with the mindset of going from prototype to production, I’ve found very helpful to define the business logic of the tasks in separate Python packages (i.e. with a setup.py file). In this way, from your Luigi script you can simply import your_package and call it from there.

A task can produce multiple files as output, but if that’s your case, you should probably verify if the task can be broken down into smaller units (i.e. multiple tasks). Do all these outputs logically belong together? Do you have dependencies between them? If you can’t break the task down, I’ve found it simpler/useful just to define the output() as a log file with the names and the timestamps of all the individual files created by the task itself. The log file name can be formatted as TaskName_timestamp_param1value_param2value_etc.

Using a workflow manager like Luigi is in general helpful because it handles dependencies, it reduces the amount of boilerplate code that is necessary for parameters and error checking, it manages failure recovery and overall it forces us to follow a clear pattern when developing the data pipeline.

It’s also important to consider its limitations:

  • It was built for batch jobs, it’s probably not useful for near real-time processing
  • It doesn’t trigger the execution for you, you still need to run the data pipeline (e.g. via a cronjob)

@MarcoBonzanini

Easy Text Analytics with the Dandelion API and Python

In the past few weeks, I’ve been playing around with some third-party Web APIs for Text Analytics, mainly for some side projects. This article is a short write-up of my experience with the Dandelion API.

Notice: I’m not affiliated with dandelion.eu and I’m not a paying customer, I’m simply using their basic (i.e. free) plan which is, at the moment, more than enough for my toy examples.

Quick Overview on the Dandelion API

The Dandelion API has a set of endpoints, for different text analytics tasks. In particular, they offer semantic analysis features for:

  • Entity Extraction
  • Text Similarity
  • Text Classification
  • Language Detection
  • Sentiment Analysis

As my attention was mainly on entity extraction and sentiment analysis, I’ll focus this article on the two related endpoints.

The basic (free) plan for Dandelion comes with a rate of 1,000 units/day (or approx 30,000 units/month). Different endpoints have a different unit cost, i.e. the entity extraction and sentiment analysis cost 1 unit per request, while the text similarity costs 3 units per request. If you need to pass a URL or HTML instead of plain text, you’ll need to add an extra unit. The API is optimised for short text, so if you’re passing more than 4,000 characters, you’ll be billed extra units accordingly.

Getting started

In order to test the Dandelion API, I’ve downloaded some tweets using the Twitter Stream API. You can have a look at a previous article to see how to get data from Twitter with Python.

As NASA recently found evidence of water on Mars, that’s one of the hot topics on social media at the moment, so let’s have a look at a couple of tweets:

  • So what you’re saying is we just found water on Mars…. But we can’t make an iPhone charger that won’t break after three weeks?
  • NASA found water on Mars while Chelsea fans are still struggling to find their team in the league table

(not trying to be funny with Apple/Chelsea fans here, I was trying to collect data to compare iPhone vs Android and some London football teams, but the water-on-Mars topic got all the attentions).

The Dandelion API also provides a Python client, but the use of the API is so simple that we can directly use a library like requests to communicate with the endpoints. If it’s not installed yet, you can simply use pip:

pip install requests

Entity Extraction

Assuming you’ve signed up for the service, you will have an application key and an application ID. You will need them to query the service. The docs also provide all the references for the available parameters, the URI to query and the response format. App ID and key are passed via the parameters $app_id and $app_key respectively (mind the initial $ symbol).

import requests
import json

DANDELION_APP_ID = 'YOUR-APP-ID'
DANDELION_APP_KEY = 'YOUR-APP-KEY'

ENTITY_URL = 'https://api.dandelion.eu/datatxt/nex/v1'

def get_entities(text, confidence=0.1, lang='en'):
    payload = {
        '$app_id': DANDELION_APP_ID,
        '$app_key': DANDELION_APP_KEY,
        'text': text,
        'confidence': confidence,
        'lang': lang,
        'social.hashtag': True,
        'social.mention': True
    }
    response = requests.get(ENTITY_URL, params=payload)
    return response.json()

def print_entities(data):
    for annotation in data['annotations']:
        print("Entity found: %s" % annotation['spot'])

if __name__ == '__main__':
    query = "So what you're saying is we just found water on Mars.... But we can't make an iPhone charger that won't break after three weeks?"
    response = get_entities(query)
    print(json.dumps(response, indent=4))

This will produce the pretty-printed JSON response from the Dandelion API. In particular, let’s have a look at the annotations:

{
    "annotations": [
        {
            "label": "Water on Mars",
            "end": 51,
            "id": 21857752,
            "start": 38,
            "spot": "water on Mars",
            "uri": "http://en.wikipedia.org/wiki/Water_on_Mars",
            "title": "Water on Mars",
            "confidence": 0.8435
        },
        {
            "label": "IPhone",
            "end": 82,
            "id": 8841749,
            "start": 76,
            "spot": "iPhone",
            "uri": "http://en.wikipedia.org/wiki/IPhone",
            "title": "IPhone",
            "confidence": 0.799
        }
    ],
    /* more JSON output here */
}

Interesting to see that “water on Mars” is one of the entities (rather than just “water” and “Mars” as separate entities). Both entities are linked to their Wikipedia page, and both come with a high level of confidence. It would be even more interesting to see a different granularity for entity extraction, as in this case there is an explicit mention of one specific aspect of the iPhone (the battery charger).

The code snippet above defines also a print_entities() function, that you can use to substitute the print statement, if you want to print out only the entity references. Keep in mind that the attribute spot will contain the text as it appears in the original input. The other attributes of the output are pretty much self-explanatory, but you can check out the docs for further details.

If we run the same code using the Chelsea-related tweet above, we can find the following entities:

{
    "annotations": [
        {
            "uri": "http://en.wikipedia.org/wiki/NASA",
            "title": "NASA",
            "spot": "NASA",
            "id": 18426568,
            "end": 4,
            "confidence": 0.8525,
            "start": 0,
            "label": "NASA"
        },
        {
            "uri": "http://en.wikipedia.org/wiki/Water_on_Mars",
            "title": "Water on Mars",
            "spot": "water on Mars",
            "id": 21857752,
            "end": 24,
            "confidence": 0.8844,
            "start": 11,
            "label": "Water on Mars"
        },
        {
            "uri": "http://en.wikipedia.org/wiki/Chelsea_F.C.",
            "title": "Chelsea F.C.",
            "spot": "Chelsea",
            "id": 7473,
            "end": 38,
            "confidence": 0.8007,
            "start": 31,
            "label": "Chelsea"
        }
    ],
    /* more JSON output here */
}

Overall, it looks quite interesting.

Sentiment Analysis

Sentiment Analysis is not an easy task, especially when performed on tweets (very little context, informal language, sarcasm, etc.).

Let’s try to use the Sentiment Analysis API with the same tweets:

import requests
import json

DANDELION_APP_ID = 'YOUR-APP-ID'
DANDELION_APP_KEY = 'YOUR-APP-KEY'

SENTIMENT_URL = 'https://api.dandelion.eu/datatxt/sent/v1'

def get_sentiment(text, lang='en'):
    payload = {
        '$app_id': DANDELION_APP_ID,
        '$app_key': DANDELION_APP_KEY,
        'text': text,
        'lang': lang
    }
    response = requests.get(SENTIMENT_URL, params=payload)
    return response.json()

if __name__ == '__main__':
    query = "So what you're saying is we just found water on Mars.... But we can't make an iPhone charger that won't break after three weeks?"
    response = get_sentiment(query)
    print(json.dumps(response, indent=4))

This will print the following output:

{
    "sentiment": {
        "score": -0.7,
        "type": "negative"
    },
    /* more JSON output here */
}

The “sentiment” attribute will give us a score (from -1, totally negative, to 1, totally positive), and a type, which is one between positive, negative and neutral.

The main limitation here is not identifying explicitely the object of the sentiment. Even if we cross-reference the entities extracted in the previous paragraph, how can we programmatically link the negative sentiment with one of them? Is the negative sentiment related to finding water on Mars, or on the iPhone? As mentioned in the previous paragraph, there is also an explicit mention to the battery charger, which is not capture by the APIs and which is the target of the sentiment for this example.

The Chelsea tweet above will also produce a negative score. After downloading some more data looking for some positive tweets, I found this:

Nothing feels better than finishing a client job that you’re super happy with. Today is a good day.

The output for the Sentiment Analysis API:

{
    "sentiment": {
        "score": 0.7333333333333334,
        "type": "positive"
    },
    /* more JSON output here */
}

Well, this one was probably very explicit.

Summary

Using a third-party API can be as easy as writing a couple of lines in Python, or it can be a major pain. I think the short examples here showcase that the “easy” in the title is well motivated.

It’s worth noting that this article is not a proper review of the Deandelion API, it’s more like a short diary entry of my experiments, so what I’m reporting here is not a rigorous evaluation.

Anyway, the feeling is quite positive for the Entity Extraction API. I did some test also using hash-tags with some acronyms, and the API was able to correctly point me to the related entity. Occasionally there are some pieces of text labelled as entities, completely out of scope. This happens mostly with some movie (or song, or album) titles appearing verbatim in the text, and probably labelled because of the little context you have in Twitter’s 140 characters.

On the Sentiment Analysis side, I think providing only one aggregated score for the whole text sometimes doesn’t give the full picture. While it makes sense in some sentiment classification task (e.g. movie reviews, product reviews, etc.), we have seen more and more work on aspect-based sentiment analysis, which is what provides the right level of granularity to understand more deeply what the users are saying. As I mentioned already, this is anyway not trivial.

Overall, I had some fun playing with this API and I think the authors did a good job in keeping it simple to use.