Building a search-as-you-type feature with Elasticsearch, AngularJS and Flask (Part 2: front-end)

This article is the second part of a tutorial which describes how to build a search-as-you-type feature based on Elasticsearch, Python/Flask and AngularJS.

The first part has discussed how to set-up Elasticsearch and a microservice in Python/Flask, i.e. the back-end part of the system. It also provided an overall view on the architecture. In this second part, we’ll discuss details about the front-end, based on AngularJS.

The full code is available at https://github.com/bonzanini/CheerMeApp-demo.

Single-Page App

The front-end is a single-page application which uses AngularJS, as well as Bootstrap for styling.

Firstly, we create an index.html page, declaring the HTML document as an AngularJS app with the ng-app attribute:

<html ng-app="myApp">

In the head declarations, we’ll need to include AngularJS itself as well as some of its components (we’re using angular-route and angular-resource), the Bootstrap stylesheet and the custom app code, e.g.

<head>
    <!-- Load AngularJS -->
    <script src="https://code.angularjs.org/1.4.3/angular.min.js"></script>
    <script src="https://code.angularjs.org/1.4.3/angular-route.min.js"></script>
    <script src="https://code.angularjs.org/1.4.3/angular-resource.min.js"></script>
    <!-- Load Bootstrap CSS-->
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
    <!-- Load custom app code -->
    <script type="text/javascript" src="app.js"></script>
</head>

The whole user interface resides in a <div> container as ng-view:

<body>
    <!-- The UI -->
    <div class="container">
        <div ng-view></div>
    </div>
</body>

Both ng-app and ng-view are directives defined by AngularJS, i.e. Angular extends HTML using attributes with a “ng-” prefix.

The core of the front-end

The main front-end component is the AngularJS app defined in app.js. The starting point is the module definition:

var myApp = angular.module("myApp", ["ngRoute", "ngResource", "myApp.services"]);

The myApp application has some dependencies, namely ngRoute for routing (e.g. for templates), ngResource to access external RESTful resources, and the custom myApp.services which defines the access such resources.

If you remember from the previous article, we have a microservice based on Python/Flask listening on localhost:5000, which provides access to a REST API. The myApp.services variable is what binds such API to our AngularJS app defining the way we access the resource, e.g.

// services definition
var services = angular.module("myApp.services", ["ngResource"]);

// create specific resources, defining the related URLs and how to access them
services
.factory('Beer', function($resource) {
    return $resource('http://localhost:5000/api/v1/beers/:id', {id: '@id'}, {
        get: { method: 'GET' },
        delete: { method: 'DELETE' }
    });
})
.factory('Beers', function($resource) {
    return $resource('http://localhost:5000/api/v1/beers', {}, {
        query: { method: 'GET', isArray: true },
        create: { method: 'POST', }
    });
})
.factory('Search', function($resource) {
    return $resource('http://localhost:5000/api/v1/search', {q: '@q'}, {
        query: { method: 'GET', isArray: true}
    });
});

Once the resources are defined, we can define the rules for routing/templating, e.g.

myApp.config(function($routeProvider) {
    $routeProvider
    .when('/', {
        templateUrl: 'pages/main.html',
        controller: 'mainController'
    })
    .when('/newBeer', {
        templateUrl: 'pages/beer_new.html',
        controller: 'newBeerController'
    })
    .when('/beers', {
        templateUrl: 'pages/beers.html',
        controller: 'beerListController'
    })
    .when('/beers/:id', {
        templateUrl: 'pages/beer_details.html',
        controller: 'beerDetailsController'
    })
});

The $routeProvider simply allows to associate a matching URL with a template (a HTML page) and a controller (a function that, among other aspects, binds data with the template).

For example the controller of the entry page can be defined as:

myApp.controller(
    'mainController',
    function ($scope, Search) {
        $scope.search = function() {
            q = $scope.searchString;
            if (q.length > 1) {
                $scope.results = Search.query({q: q});    
            }
        };
    }
);

In the controller, there are three references to the scope, namely searchString, results and search. The first one is the content of the input field used for search, i.e.

<input type="text" class="form-control" ng-model="searchString" placeholder='Search: e.g. "light beer" or "London"' ng-change="search()"/>

while the second one is the list of results, in form of table rows, i.e.

<tr ng-repeat="result in results">
    <td><a href="#/beers/{{result.id}}">{{ result.name }}</a></td>
    <td>{{ result.producer }}</td>
</tr>

The third reference is a function, search(), defined in the controller itself, and invoked by the UI whenever the text in the input field is changed. The function checks if the text has at least two characters, and then sends it as a query to the Search resource declared at the beginning in the services var (i.e. as part of the REST API). If the search provides results (a list of beers along with their producers), these are shown are table rows.

The two HTML definitions above are part of the pages/main.html template described above and linked to the mainController().

Other controllers are defined in a similar fashion, and they all define the behavior of a specific view, just with a few lines of Javascript.

Summary

Using AngularJS and Bootstrap, we have quickly created a simple and clean UI for our search-as-you-type system. As the access to the data happens through the microservice defined in the previous article, we have defined the access to the REST API as ngResource.

Each view in the UI is defined as a template, i.e. a HTML page. The behaviour of the UI and the data binding is defined in the controllers.

All in all, with a relatively small amount of Javascript code, AngularJS allows to build an interactive UI which can access REST resources.

Links:

@MarcoBonzanini

Building a Search-As-You-Type Feature with Elasticsearch, AngularJS and Flask

Search-as-you-type is an interesting feature of modern search engines, that allows users to have an instant feedback related to their search, while they are still typing a query.

In this tutorial, we discuss how to implement this feature in a custom search engine built with Elasticsearch and Python/Flask on the backend side, and AngularJS for the frontend.

The full code is available at https://github.com/bonzanini/CheerMeApp-demo. If you go through the code, have a look at the readme file first, in particular to understand the limitations of the code.

This first part describes the details of the backend, i.e. Elasticsearch and Python/Flask.

Update: the second part of this tutorial has been published and it discusses the front-end in AngularJS.

Overall Architecture

As this demo was prototyped during International Beer Day 2015, we’ll build a small database of beers, each of which will be defined by a name, the name of its producer, a list of beer styles and a textual description. The idea is to make all these data available for search in one coherent interface, so you can just type in the name of your favourite brew, or aspects like “light and fruity”.

Our system is made up of three components:

  • Elasticsearch: used as data storage and for its search capabilities.
  • Python/Flask Microservice: the backend component that has access to Elasticsearch and provides a RESTful API for the frontend.
  • AngularJS UI: the frontend that requests data to the backend microservice.

There are two types of documents – beers and styles. While styles are simple strings with the style name, beers are more complex. This is an example:

{
    "name": "Raspberry Wheat Beer", 
    "styles": ["Wheat Ale", "Fruit Beer"], 
    "abv": 5.0, 
    "producer": "Meantime Brewing London", 
    "description": "Based on a pale, lightly hopped wheat beer, the refreshingly crisp fruitiness, aroma and rich colour come from the addition of fresh raspberry puree during maturation."
}

(the description is taken from the producer’s website in August 2015).

Setting Up Elasticsearch

The mapping for the Elasticsearch types is fairly straightforward. The key detail in order to enable the search-as-you-type feature is how to perform partial matching over strings.

One option is to use wildcard queries over not_analyzed fields, similar to a ... WHERE field LIKE '%foobar%' query in SQL, but this is usually too expensive. Another option is to change the analysis chain in order to index also partial strings: this will result in a bigger index but in faster queries.

We can achieve our goal by using the edge_ngram filter as part of a custom analyser, e.g.:

{
    "settings": {
        "number_of_shards" : 1,
        "number_of_replicas": 0,
        "analysis": {
            "filter": {
                "autocomplete_filter": {
                    "type":     "edge_ngram",
                    "min_gram": 2,
                    "max_gram": 15
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

In this example, the custom filter will allow to index substrings of 2-to-15 characters. You can customise these boundaries, but indexing unigram (min_gram: 1) probably will cause any query to match any document, and words longer than 15 chars are rarely observed (e.g. we’re not dealing with long compounds).

Once the custom analysis chain is defined, the mapping is easy:

{    
    "mappings": {
        "beers": {
            "properties": {
                "name": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"},
                "styles": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"},
                "abv": {"type": "float"},
                "producer": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"},
                "description": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"}
            }
        },
        "styles": {
            "properties": {
                "name": {"type": "string", "index": "not_analyzed"}
            }
        }
    }
}

Populating Elasticsearch

Assuming you have Elasticsearch up-and-running locally on localhost:9200 (the default), you can simply type make index from the demo folder.

This will firstly try to delete an index called cheermeapp (you’ll see a missing index error the first time, as there is of course no index yet). Secondly, the index is recreated by pushing the mapping file to Elasticsearch, and finally some data are indexed using the _bulk API.

If you want to see some data, you can now type:

curl -XPOST http://localhost:9200/cheermeapp/beers/_search?pretty -d '{"query": {"match_all": {}}}'

A Python Microservice with Flask

As the Elasticsearch service is by default open to any connection, it is common practice to put it behind a custom web-service. Luckily, Flask and its Flask-RESTful extension allow use to quickly set up a RESTful microservice which exposes some useful endpoints. These endpoints will then be queries by the frontend.

If you’re following the code from the repo, the recommendation is to set-up a local virtualenv as described in the readme, in order to install the dependencies locally. You can see the full code for the backend microservice is the backend folder.

In particular, in backend/__init__.py we declare the Flask application as:

from flask import Flask
from flask_restful import reqparse, Resource, Api
from flask.ext.cors import CORS
from . import config
import requests
import json

app = Flask(__name__)
CORS(app) # required for Cross-origin Request Sharing
api = Api(app)

By setting up the backend app as Python package (a folder with an __init__.py file), the script to run this app is extremely simple:

# runbackend.py
from backend import app

if __name__ == '__main__':
    app.run(debug=True)

This code just sets up an empty web-service: we need to implement the endpoints and the related resources. One nice aspect of Flask-RESTful is that it allows to define the resources as Python classes, adding the endpoints with minimal effort.

For example, in backend/__init__.py we can continue defining the following:

class Beer(Resource):

    def get(self, beer_id):
        # the base URL for a "beers" object in Elasticsearch, e.g.
        # http://localhost:9200/cheermeapp/beers/<beer_id>
        url = config.es_base_url['beers']+'/'+beer_id
        # query Elasticsearch
        resp = requests.get(url)
        data = resp.json()
        # Return the full Elasticsearch object as a result
        beer = data['_source']
        return beer

    def delete(self, beer_id):
        # same as above
        url = config.es_base_url['beers']+'/'+beer_id
        # Query Elasticsearch
        resp = requests.delete(url)
        # return the response
        data = resp.json()
        return data
# The API URLs all start with /api/v1, in case we need to implement different versions later
api.add_resource(Beer, config.api_base_url+'/beers/<beer_id>')

class BeerList(Resource):

    def get(self):
        # same as above
        url = config.es_base_url['beers']+'/_search'
        # we retrieve all the beers (well, at least the first 100)
        # Limitation: pagination to be implemented
        query = {
            "query": {
                "match_all": {}
            },
            "size": 100
        }
        # query Elasticsearch
        resp = requests.post(url, data=json.dumps(query))
        data = resp.json()
        # build an array of results and return it
        beers = []
        for hit in data['hits']['hits']:
            beer = hit['_source']
            beer['id'] = hit['_id']
            beers.append(beer)
        return beers
api.add_resource(BeerList, config.api_base_url+'/beers')

The above code implements the GET and DELETE methods for /api/v1/beers/, which respectively retrieve and delete a specific beer, and the GET method for the /api/v1/beers, which retrieve the full list of beers. In the repo, you can also observe the POST method implemented on the BeerList class, which allows to create a new beer.

Design note: given that create-read-update operations, as well as the search, will work on the same data model, it’s probably more sensible to de-couple the object model from the endpoint definition, e.g. by defining a BeerModel and call it from the related resources.

From the repo, you can also see the implementation of the /api/v1/styles endpoint.

One the backend is running, the service will be accessible at localhost:5000 (the default option for Flask). You can test it with:

curl -XGET http://localhost:5000/api/v1/beers

The Search Functionality

Besides serving “items”, our microservice also incorporates a search functionality:

class Search(Resource):

    def get(self):
        # parse the query: ?q=[something]
        parser.add_argument('q')
        query_string = parser.parse_args()
        # base search URL
        url = config.es_base_url['beers']+'/_search'
        # Query Elasticsearch
        query = {
            "query": {
                "multi_match": {
                    "fields": ["name", "producer", "description", "styles"],
                    "query": query_string['q'],
                    "type": "cross_fields",
                    "use_dis_max": False
                }
            },
            "size": 100
        }
        resp = requests.post(url, data=json.dumps(query))
        data = resp.json()
        # Build an array of results
        beers = []
        for hit in data['hits']['hits']:
            beer = hit['_source']
            beer['id'] = hit['_id']
            beers.append(beer)
        return beers
api.add_resource(Search, config.api_base_url+'/search')

The above code will make a /api/v1/search endpoint available for custom queries.

The interface with Elasticsearch is a custom multi_match and cross_fields query, which searches over the name, producer, styles and description fields, i.e. all the textual fields.

By default, Elasticsearch performs multi_match queries as best_fields, which means only the field with the best score will give the overall score for a particular document. In our case, we prefer to have all the fields to contribute to the final score. In particular, we want to avoid longer fields like the description to be penalised by the document length normalisation.

Design note: notice how we’re duplicating the same code at the end of Search.get() and BeerList.get(), we should really decouple this.

You can test the search service with:

curl -XGET http://localhost:5000/api/v1/search?q=lon
# will retrieve all the beers matching "lon", e.g. containing the string "london"

The next step is to create the frontend to query the microservice and show the results in a nice UI. The implementation is already available in the repo, and will be discussed in the next article.

Summary

This article sets up the backend side of a search-as-you-type application.

The scenario is the CheerMeApp application, a mini database of beers with names, styles and descriptions. The search application can match any of these fields while the user is still typing, i.e. with partial string matching.

The backend side of the app is based on Elasticsearch for the data storage and search functionality. In particular, by indexing the substrings (n-grams) we allow for partial string matching, by increasing the size of the index on disk without hurting query-time performances.

The data storage is “hidden” behind a Python/Flask microservice, which provides endpoint for a client to query. In particular, we have seen how the Flask-RESTful extension allows to quickly create RESTful applications by simply declaring the resources as Python classes.

The next article will discuss some aspects of the frontend, developed in AngularJS, and how to link it with the backend.

Getting Started with Apache Spark and Python 3

Apache Spark is a cluster computing framework, currently one of the most actively developed in the open-source Big Data arena. It aims at being a general engine for large-scale data processing, supporting a number of platforms for cluster management (e.g. YARN or Mesos as well as Spark native) and a variety of distributed storage systems (e.g. HDFS or Amazon S3).

More interestingly, at least from a developer’s perspective, it supports a number of programming languages. Since the latest version 1.4 (June 2015), Spark supports R and Python 3 (to complement the previously available support for Java, Scala and Python 2).

This article is a brief introduction on how to use Spark on Python 3.

Quick Start

After downloading a binary version of Spark 1.4, we can extract it in a custom folder, e.g. ~/apps/spark, which we’ll call $SPARK_HOME:

export SPARK_HOME=~/apps/spark

This folder contains several Spark commands (in $SPARK_HOME/bin) as well as examples of code (in $SPARK_HOME/examples/src/main/YOUR-LANGUAGE).

We can run Spark with Python in two ways: using the interactive shell, or submitting a standalone application.

Let’s start with the interactive shell, by running this command:

$SPARK_HOME/bin/pyspark

You will get several messages on the screen while the shell is loading, and at the end you should see the Spark banner:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.4.0
      /_/

Using Python version 2.7.5 (default, Mar  9 2014 22:15:05)
SparkContext available as sc, HiveContext available as sqlContext.
>>>

The >>> prompt is the usual Python prompt, as effectively we are using a Python interactive shell.

SparkContext and HiveContext are Spark concepts that we’ll briefly explain below. The interactive shell is telling us that these two contexts have been initialised and are available as sc and sqlContext in this session. The shell is also telling us that we’re using Python 2.7!?

But I want to use Python 3!

Long story short, Python 2 is still the default option in Spark, which you can see if you open the pyspark script with an editor (it’s a shell script). You can simply override this behaviour by setting an environment variable:

export PYSPARK_PYTHON=python3

Once you re-run the interactive shell, the Spark banner should be updated to reflect your version of Python 3.

Some Basic Spark Concepts

Two Spark concepts have already been mentioned above:

  • SparkContext: it’s an object that represents a connection to a computing cluster – an application will access Spark through this context
  • HiveContext: it’s an instance of the Spark SQL engine, that integrates data stored in Hive (not used in this article)

Another core concept in Spark is the Resilient Distributed Dataset (RDD), an immutable distributed collection of objects. Each RDD is split into partitions, which might be processed on different nodes of a cluster.

RDDs can be loaded from external sources, e.g. from text files, or can be transformed into new RDDs.

There are two types of operation that can be performed over a RDD:

  • a transformation will leave the original RDD intact and create a new one (RDD are immutable); an example of transformation is the use of a filter
  • an action will compute a result based on the RDD, e.g. counting the number of lines in a RDD

Running an Example Application

Running the interactive shell can be useful for interactive analysis, but sometimes you need to launch a batch job, i.e. you need to submit a stand-alone application.

Consider the following code and save it as line_count.py:

from pyspark import SparkContext

import sys

if __name__ == '__main__':

    fname = sys.argv[1]
    search1 = sys.argv[2].lower()
    search2 = sys.argv[3].lower()

    sc = SparkContext("local", appName="Line Count")
    data = sc.textFile(fname)

    # Transformations
    filtered_data1 = data.filter(lambda s: search1 in s.lower())
    filtered_data2 = data.filter(lambda s: search2 in s.lower())
    # Actions
    num1 = filtered_data1.count()
    num2 = filtered_data2.count()

    print('Lines with "%s": %i, lines with "%s": %i' % (search1, num1, search2, num2))

The application will take three parameters from the command line (via sys.argv), namely a file name and two search terms.

The sc variable contains the SparkContext, initialised as local context (we’re not using a cluster).

The data variable is the RDD, loaded from an external resource (the aforementioned file).

What follows the data import is a series of transformations and actions. The basic idea is to simply count how many lines contain the given search terms.

For this example, I’m using a data-set of tweets downloaded for a previous article, stored in data.json one tweet per line.

I can now launch (submit) the application with:

$SPARK_HOME/bin/spark-submit line_count.py data.json \#ita \#eng

Within a lot of debugging information, the application will print out the final count:

Lines with "#ita": 1339, lines with "#eng": 2278

Notice the use of the backslash from the command-line, because we need to escape the # symbol: effectively the search terms are #ita and #eng.

Also notice that we don’t have information about repeated occurrences of the search terms, nor about partial matches (e.g. “#eng” will also match “#england”, etc.): this example just showcases the use of transformations and actions.

Summary

Spark now supports Python 3 :)

@MarcoBonzanini

How to Develop and Distribute Python Packages

This article contains some notes about the development of Python modules and packages, as well as brief overview on how to distribute a package in order to make it easy to install via pip.

Modules vs Packages in Python

Firstly, let’s start from the distinction between modules and packages, which is something sligthly different from language to language.

In Python, a simple source file containing the definitions of functions, classes and variables is a module. Once your application grows, you can organise your code into different files (modules) so that you can keep your sources tidy and clean, and you can re-use some of the functionalities in other applications.

On the other side, a package is a folder containing a __init__.py file, as well as other different Python source files. Typically a package contains several modules and sub-packages.

For example, you could have a foobar.py file where you declare a hello() function. You can re-use the function in different ways:

# import whole module and use its namespace
import foobar
foobar.hello()
# import specific function in local namespace
from foobar import hello
hello()
# import specific function in local namespace, create an alias
from foobar import hello as hi
hi()
# import all module declarations in local namespace
from foobar import *
hello()

The last option is usually considered sub-optimal, because you’re going to pollute the local namespace causing potential name conflicts. For example, assuming you imported some maths libraries and you’re using the log() function, is it coming from math.log() or numpy.log()? I usually aim for clarity when I choose which option is more suitable for a particular case.

Similarly, you can import a package, a particular definition, a sub-package, etc.

Notice: the import command will look for modules and packages in the working directory as well as folders declared in the Python path. You can find out where your libraries are stored by looking at:

import sys
print(sys.path)

The Python path can be extended with user-specific folders by overriding the $PYTHONPATH environment variable.

This means that if you want to make a particular module/package available to an application, it must either be in the working directory or in one of the folders dedicated to Python libraries. The latter option is usually achieved via the creation of an installation script.

Setup Tools and setup.py

As part of the Python Standard Library, the main component to develop installation scripts is distutils. However, to overcome its limitations, setuptools is now the recommended options.

By creating a setup.py script in the parent folder of your package, you can make it easy to install if you share it via Github or if you make it available for pip.

The basic structure of setup.py looks like:

from setuptools import setup

long_description = 'Looong description of your package, e.g. a README file'

setup(name='yourpackage', # name your package
      packages=['yourpackage'], # same name as above
      version='1.0.0', 
      description='Short description of your package',
      long_description=long_description,
      url='http://example.org/yourpackage',
      author='Your Name',
      author_email='your.name@example.org',
      license='MIT') # choose the appropriate license

The source code of the package should be put into a folder names with the package name itself, while the setup script should be in the parent directory together with the documentation. This is an example of source structure:

.
├── LICENSE
├── README.rst
├── setup.py
└── yourpackage
    ├── __init__.py
    ├── some_module.py
    ├── other_module.py
    └── sub_package
        ├── __init__.py
        └── more_modules.py

The LICENSE and README.rst files are documentation, the setup.py file is the installation script as above, while the whole source code of the package with its components is under the yourpackage folder.

You could install the package and make it available for any of your Python apps with:

python setup.py install

If you publish the above structure on a public repository, e.g. on Gibhub, anyone could easily install it with:

git clone https://www.github.com/yourname/yourpackage
cd yourpackage
python setup.py install

PyPI as Public Repo

PyPI, the Python Package Index, also known as the CheeseShop, is where developers can publish their Python packages to make them available for easy installation via pip.

Once your package is ready to be published, you’ll need to register your account on PyPI. You should also register your new package on PyPI: you can do so using the web form on the PyPI website.

Once your account is ready, create a file called .pypirc in your home folder:

$ cat ~/.pypirc 

[distutils]
index-servers=pypi

[pypi]
repository = https://pypi.python.org/pypi
username = your-username
password = your-password

Now you’re ready to push your package to the publish index:

python setup.py sdist upload

The sdist command will create the package to distribute, while the upload command will push it to the public repository using the information that you stored in ~/.pypirc.

At this point, you can install your brand new Python package on any machine by typing:

pip install yourpackage

Conclusion

Organising your code into modules and packages will help keeping your codebase clean. In particular, packing your code into meaningful packages will improve code re-use. There are only a few simply steps to follow in order to create a Python package that can be easily distributed, and if you decide to do so, the Python Package Index is the obvious choice.

Mining Twitter Data with Python (and JS) – Part 7: Geolocation and Interactive Maps

Geolocation is the process of identifying the geographic location of an object such as a mobile phone or a computer. Twitter allows its users to provide their location when they publish a tweet, in the form of latitude and longitude coordinates. With this information, we are ready to create some nice visualisation for our data, in the form of interactive maps.

This article briefly introduces the GeoJSON format and Leaflet.js, a nice Javascript library for interactive maps, and discusses its integration with the Twitter data we have collected in the previous parts of this tutorial (see Part 4 for details on the rugby data set).

Tutorial Table of Contents:

GeoJSON

GeoJSON is a format for encoding geographic data structures. The format supports a variety of geometric types that can be used to visualise the desired shapes onto a map. For our examples, we just need the simplest structure, a Point. A point is identified by its coordinates (latitude and longitude).

In GeoJSON, we can also represent objects such as a Feature or a FeatureCollection. The first one is basically a geometry with additional properties, while the second one is a list of features.

Our Twitter data set can be represented in GeoJSON as a FeatureCollection, where each tweet would be an individual Feature with its one geometry (the aforementioned Point).

This is how the JSON structure looks like:

{
    "type": "FeatureCollection",
    "features": [
        { 
            "type": "Feature",
            "geometry": {
                "type": "Point", 
                "coordinates": [some_latitude, some_longitude]
            },
            "properties": {
                "text": "This is sample a tweet",
                "created_at": "Sat Mar 21 12:30:00 +0000 2015"
            }
        },
        /* more tweets ... */
    ]
}

From Tweets to GeoJSON

Assuming the data are stored in a single file as described in the first chapter of this tutorial, we simply need to iterate all the tweets looking for the coordinates field, which may or may not be present. Keep in mind that you need to use coordinates, because the geo field is deprecated (see the API).

This code will read the data set, looking for tweets where the coordinates are explicitely given. Once the GeoJSON data structure is created (in the form of a Python dictionary), then the data are dumped into a file called geo_data.json:

# Tweets are stored in "fname"
with open(fname, 'r') as f:
    geo_data = {
        "type": "FeatureCollection",
        "features": []
    }
    for line in f:
        tweet = json.loads(line)
        if tweet['coordinates']:
            geo_json_feature = {
                "type": "Feature",
                "geometry": tweet['coordinates'],
                "properties": {
                    "text": tweet['text'],
                    "created_at": tweet['created_at']
                }
            }
            geo_data['features'].append(geo_json_feature)

# Save geo data
with open('geo_data.json', 'w') as fout:
    fout.write(json.dumps(geo_data, indent=4))

Interactive Maps with Leaflet.js

Leaflet.js is an open-source Javascript library for interactive maps. You can create maps with tiles of your choice (e.g. from OpenStreetMap or MapBox), and overlap interactive components.

In order to prepare a web page that will host a map, you simply need to include the library and its CSS, by putting in the head section of your document the following lines:

<link rel="stylesheet" href="http://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.3/leaflet.css" />
<script src="http://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.3/leaflet.js"></script>

Moreover, we have all our GeoJSON data in a separate file, so we want to load the data dynamically rather than manually put all the points in the map. For this purpose, we can easily play with jQuery, which we also need to include:

<script src="http://code.jquery.com/jquery-2.1.0.min.js"></script>

The map itself will be placed into a div element:

<!-- this goes in the <head> -->
<style>
#map {
    height: 600px;
}
</style>
<!-- this goes in the <body> -->
<div id="map"></div>

We’re now ready to create the map with Leaflet:

// Load the tile images from OpenStreetMap
var mytiles = L.tileLayer('http://{s}.tile.osm.org/{z}/{x}/{y}.png', {
    attribution: '&copy; <a href="http://osm.org/copyright">OpenStreetMap</a> contributors'
});
// Initialise an empty map
var map = L.map('map');
// Read the GeoJSON data with jQuery, and create a circleMarker element for each tweet
// Each tweet will be represented by a nice red dot
$.getJSON("./geo_data.json", function(data) {
    var myStyle = {
        radius: 2,
        fillColor: "red",
        color: "red",
        weight: 1,
        opacity: 1,
        fillOpacity: 1
    };

    var geojson = L.geoJson(data, {
        pointToLayer: function (feature, latlng) {
            return L.circleMarker(latlng, myStyle);
        }
    });
    geojson.addTo(map)
});
// Add the tiles to the map, and initialise the view in the middle of Europe
map.addLayer(mytiles).setView([50.5, 5.0], 5);

A screenshot of the results:

rugby-map-osm

The above example uses OpenStreetMap for the tile images, but Leaflet lets you choose other services. For example, in the following screenshot the tiles are coming from MapBox.

rugby-map-mapbox

You can see the interactive maps in action here:

Summary

In general there are many options for data visualisation in Python, but in terms of browser-based interaction, Javascript is also an interesting option, and the two languages can play well together. This article has shown that building a simple interactive map is a fairly straightforward process.

With a few lines of Python, we’ve been able to transform our data into a common format (GeoJSON) that can be passed onto Javascript for visualisation. Leaflet.js is a nice Javascript library that, almost out of the box, lets us create some nice interactive maps.

Tutorial Table of Contents:

@MarcoBonzanini

Functional Programming in Python

This is probably not the newest of the topics, but I haven’t had the chance to dig into it before, so here we go.

Python supports multiple programming paradigms, but it’s not best known for its Functional Programming style.
As its own creator has mentioned before, Python hasn’t been heavily influenced by other functional languages, but it does show some functional features. This is a very gentle introduction to Functional Programming in Python.

What is Functional Programming anyway?

Functional Programming is a programming paradigm based on the evaluation of expression, which avoids changing-state and mutable data. Popular functional languages include Lisp (and family), ML (and family), Haskell, Erlang and Clojure.

Functional Programming can be seen as a subset of Declarative Programming (as opposed to Imperative Programming).

Some of the characteristics of Functional Programming are:

  • Avoid state representation
  • Data are immutable
  • First class functions
  • High-order functions
  • Recursion

One of the key motivations beyond Functional Programming, and beyond some of its aspects like no changing-state and no mutable data, is the need to eliminate side effects, i.e. changes in the state that don’t really depend on the function input, making the program more difficult to understand and predict.

State representation and Transformation

In functional programming, functions transform input into output, without an intermediate representation of the current state.

Consider the following code:

def square(x):
    return x*x

input = [1, 2, 3, 4]
output = []
for v in input:
    output.append(square(v))

The code takes a list as input and iterates through its individual values applying the function square() to them; the final outcome is stored in the variable output which is initially an empty list, and is updated during the iteration.

Now, let’s consider the following functional style:

def square(x):
    return x*x

input = [1, 2, 3, 4]
output = map(square, input)

The logic is very similar: apply the function square() to all the element of the list given as input, but there is no internal state representation. The output is also a little bit different, as the map() function will return an iterator rather than a list.

Data are immutable

Related to the point above, the concept of immutable data can be summarised as: functions will return new data as the outcome of a computation, leaving the original input intact.

As mentioned earlier, Python is not a purely functional language, so there are both mutable and immutable data structures, e.g. lists are mutable arrays, while tuples are immutable:

>>> mutable = [1, 2, 3, 4]
>>> mutable[0] = 100
>>> mutable
[100, 2, 3, 4]
>>> mutable.append('hello')
>>> mutable
[100, 2, 3, 4, 'hello']

>>> immutable = (1, 2, 3, 4)
>>> immutable[0] = 100
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
>>> immutable.append('hello')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'
>>> immutable
(1, 2, 3, 4)

First class functions and High-order functions

Functions being first class means that they can appear anywhere in a program, including return values and arguments of other functions. For example:

def greet(name):
    print("Hello %s!" % name)

say_hello = greet
say_hello('World')
# Hello World!

High-order functions are functions which can take functions as arguments (or return them as results). For example, this is what the map() function does, as one of the arguments is the name of the function to apply to a sequence. Notice the difference between the following two lines:

# some_func is an input of other_func()
out = other_func(some_func, some_data)
# the output of some_func() is the input of other_func()
out = other_func(some_func(some_data))

The first case represents a high-order function.

Recursion

“To understand recursion, you must understand recursion.”

recursion

A recursive function calls itself on a smaller input, until the problem is reduced to some base case. For example, the factorial can be calculated in Python as:

def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)

Recursion is also a way to achieve iteration, usually in a more compact and elegant way, again without using internal state representation and without modifying the input data.

While it is possible to write recursive functions in Python, its interpreter doesn’t recognise tail recursion, an optimisation technique typically used by functional languages.

What else?

Besides the already mentioned map() function, there is more Functional Programming support in the Python standard library.

reduce() and filter()

The functions reduce() and filter() naturally belong together with the map() function. While map will apply a function over a sequence, producing a sequence as output, reduce will apply a function of two arguments cumulatively to the items in a sequence, in order to reduce the sequence to a single value.

In Python 3, reduce() is not a core built-infunction anymore and it has been moved to functools (still part of the standard library), i.e. in order to use it you need to import it as follows:

from functools import reduce

The use of filter() consists in applying a function to each item in a sequence, building a new sequence for those items that return true. For example:

def is_odd(n):
   return n%2 == 1
 
print(is_odd(2))
# False
print(is_odd(3))
# True
values = [1, 2, 3, 4, 5, 6]
result = filter(is_odd, values)
print(list(result))
# [1, 3, 5]

Notice: the term sequence has to be intended as iterable, and for both map() and filter() the return type is a generator (hence the use of list() to visualise the actual list).

lambda functions

Python supports the definition of lambda functions, i.e. small anonymous functions not necessarily bound to a name, at runtime. The use of anonymous functions is fairly common in Functional Programming, and it’s easy to see their use related to map/reduce/filter:

words = "The quick brown fox".split()
words_len = map(lambda w: len(w), words)
print(words)
# ['The', 'quick', 'brown', 'fox']
print(list(words_len))
# [3, 5, 5, 3]

Notice: the short lambda function above is just an example, in this case we could use len() directly.

itertools

The itertools module groups a number of functions related to iterators and their use/combination. Most of the basic and not-so-basic functionality that you might need to deal with iterators are already implemented there. The official Python documentation has a nice tutorial on Functional Programming, where the itertools module is described in details.

Generators for lazy-evaluation

Generator functions and generator expressions in Python provide objects that behave like an iterator (i.e. they can be looped over), without having the full content of the iterable loaded in memory. This concept is linked to lazy-evaluation, i.e. a call-by-need strategy where an item in a sequence is evaluated only when it’s needed, otherwise it’s not loaded in memory. This has repercussions on the amount of memory needed to deal with large sequences.

Some Final Thoughts

The topic of Functional Programming deserves definitely more than a blog post.

While Functional Programming has its benefits, Python is not a pure functional language. In particular, the mix of mutable and immutable data structures makes it difficult to truly separate the functional aspects of the language from the non-functional ones. The lack of some optimisations (e.g. no tail recursion) also makes some aspects of Functional Programming particularly costly, raising some questions about efficiency.

Even though Python is not openly inspired by other functional language, I think it’s always important to have an open mind regarding different programming paradigms. From this point of view, knowing what Python can (and can’t) do and how can only be beneficial for any Python user.

Please leave a comment or share if you liked (or disliked!) the article.

@MarcoBonzanini

Mining Twitter Data with Python (Part 6 – Sentiment Analysis Basics)

Sentiment Analysis is one of the interesting applications of text analytics. Although the term is often associated with sentiment classification of documents, broadly speaking it refers to the use of text analytics approaches applied to the set of problems related to identifying and extracting subjective material in text sources.

This article continues the series on mining Twitter data with Python, describing a simple approach for Sentiment Analysis and applying it to the rubgy data set (see Part 4).

Tutorial Table of Contents:

A Simple Approach for Sentiment Analysis

The technique we’re discussing in this post has been elaborated from the traditional approach proposed by Peter Turney in his paper Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. A lot of work has been done in Sentiment Analysis since then, but the approach has still an interesting educational value. In particular, it is intuitive, simple to understand and to test, and most of all unsupervised, so it doesn’t require any labelled data for training.

Firstly, we define the Semantic Orientation (SO) of a word as the difference between its associations with positive and negative words. In practice, we want to calculate “how close” a word is with terms like good and bad. The chosen measure of “closeness” is Pointwise Mutual Information (PMI), calculated as follows (t1 and t2 are terms):

\mbox{PMI}(t_1, t_2) = \log\Bigl(\frac{P(t_1 \wedge t_2)}{P(t_1) \cdot P(t_2)}\Bigr)

In Turney’s paper, the SO of a word was calculated against excellent and poor, but of course we can extend the vocabulary of positive and negative terms. Using V^{+} and a vocabulary of positive terms and V^{-} for the negative ones, the Semantic Orientation of a term t is hence defined as:

\mbox{SO}(t) = \sum_{t' \in V^{+}}\mbox{PMI}(t, t') - \sum_{t' \in V^{-}}\mbox{PMI}(t, t')

We can build our own list of positive and negative terms, or we can use one of the many resources available on-line, for example the opinion lexicon by Bing Liu.

Computing Term Probabilities

In order to compute P(t) (the probability of observing the term t) and P(t_1 \wedge t_2) (the probability of observing the terms t1 and t2 occurring together) we can re-use some previous code to calculate term frequencies and term co-occurrences. Given the set of documents (tweets) D, we define the Document Frequency (DF) of a term as the number of documents where the term occurs. The same definition can be applied to co-occurrent terms. Hence, we can define our probabilities as:

P(t) = \frac{\mbox{DF}(t)}{|D|}\\  P(t_1 \wedge t_2) = \frac{\mbox{DF}(t_1 \wedge t_2)}{|D|}

In the previous articles, the document frequency for single terms was stored in the dictionaries count_single and count_stop_single (the latter doesn’t store stop-words), while the document frequency for the co-occurrencies was stored in the co-occurrence matrix com

This is how we can compute the probabilities:

# n_docs is the total n. of tweets
p_t = {}
p_t_com = defaultdict(lambda : defaultdict(int))

for term, n in count_stop_single.items():
    p_t[term] = n / n_docs
    for t2 in com[term]:
        p_t_com[term][t2] = com[term][t2] / n_docs

Computing the Semantic Orientation

Given two vocabularies for positive and negative terms:

positive_vocab = [
    'good', 'nice', 'great', 'awesome', 'outstanding',
    'fantastic', 'terrific', ':)', ':-)', 'like', 'love',
    # shall we also include game-specific terms?
    # 'triumph', 'triumphal', 'triumphant', 'victory', etc.
]
negative_vocab = [
    'bad', 'terrible', 'crap', 'useless', 'hate', ':(', ':-(',
    # 'defeat', etc.
]

We can compute the PMI of each pair of terms, and then compute the
Semantic Orientation as described above:

pmi = defaultdict(lambda : defaultdict(int))
for t1 in p_t:
    for t2 in com[t1]:
        denom = p_t[t1] * p_t[t2]
        pmi[t1][t2] = math.log2(p_t_com[t1][t2] / denom)

semantic_orientation = {}
for term, n in p_t.items():
    positive_assoc = sum(pmi[term][tx] for tx in positive_vocab)
    negative_assoc = sum(pmi[term][tx] for tx in negative_vocab)
    semantic_orientation[term] = positive_assoc - negative_assoc

The Semantic Orientation of a term will have a positive (negative) value if the term is often associated with terms in the positive (negative) vocabulary. The value will be zero for neutral terms, e.g. the PMI’s for positive and negative balance out, or more likely a term is never observed together with other terms in the positive/negative vocabularies.

We can print out the semantic orientation for some terms:

semantic_sorted = sorted(semantic_orientation.items(), 
                         key=operator.itemgetter(1), 
                         reverse=True)
top_pos = semantic_sorted[:10]
top_neg = semantic_sorted[-10:]

print(top_pos)
print(top_neg)
print("ITA v WAL: %f" % semantic_orientation['#itavwal'])
print("SCO v IRE: %f" % semantic_orientation['#scovire'])
print("ENG v FRA: %f" % semantic_orientation['#engvfra'])
print("#ITA: %f" % semantic_orientation['#ita'])
print("#FRA: %f" % semantic_orientation['#fra'])
print("#SCO: %f" % semantic_orientation['#sco'])
print("#ENG: %f" % semantic_orientation['#eng'])
print("#WAL: %f" % semantic_orientation['#wal'])
print("#IRE: %f" % semantic_orientation['#ire'])

Different vocabularies will produce different scores. Using the opinion lexicon from Bing Liu, this is what we can observed on the Rugby data-set:

# the top positive terms
[('fantastic', 91.39950482011552), ('@dai_bach', 90.48767241244532), ('hoping', 80.50247748725415), ('#it', 71.28333427277785), ('days', 67.4394844955977), ('@nigelrefowens', 64.86112716005566), ('afternoon', 64.05064208341855), ('breathtaking', 62.86591435212975), ('#wal', 60.07283361352875), ('annual', 58.95378954406133)]
# the top negative terms
[('#england', -74.83306534609066), ('6', -76.0687215594536), ('#itavwal', -78.4558633116863), ('@rbs_6_nations', -80.89363516601993), ("can't", -81.75379628180468), ('like', -83.9319149443813), ('10', -85.93073078165587), ('italy', -86.94465165178258), ('#engvfra', -113.26188957010228), ('ball', -161.82146824640125)]
# Matches
ITA v WAL: -78.455863
SCO v IRE: -73.487661
ENG v FRA: -113.261890
# Individual team
#ITA: 53.033824
#FRA: 14.099372
#SCO: 4.426723
#ENG: -0.462845
#WAL: 60.072834
#IRE: 19.231722

Some Limitations

The PMI-based approach has been introduced as simple and intuitive, but of course it has some limitations. The semantic scores are calculated on terms, meaning that there is no notion of “entity” or “concept” or “event”. For example, it would be nice to aggregate and normalise all the references to the team names, e.g. #ita, Italy and Italia should all contribute to the semantic orientation of the same entity. Moreover, do the opinions on the individual teams also contribute to the overall opinion on a match?

Some aspects of natural language are also not captured by this approach, more notably modifiers and negation: how do we deal with phrases like not bad (this is the opposite of just bad) or very good (this is stronger than just good)?

Summary

This article has continued the tutorial on mining Twitter data with Python introducing a simple approach for Sentiment Analysis, based on the computation of a semantic orientation score which tells us whether a term is more closely related to a positive or negative vocabulary. The intuition behind this approach is fairly simple, and it can be implemented using Pointwise Mutual Information as a measure of association. The approach has of course some limitations, but it’s a good starting point to get familiar with Sentiment Analysis.

@MarcoBonzanini

Getting started with Neo4j and Python

This article is a brief introduction to Neo4j, one of the most popular graph databases, and its integration with Python.

Graph Databases

Graph databases are a family of NoSQL databases, based on the concept of modelling your data as a graph, i.e. a collection of nodes (representing entities) and edges (representing relationships).

The motivation behind the use of a graph database is the need to model small records which are deeply interconnected, forming a complex web that is difficult to represent in a relational fashion. Graph databases are particularly good at supporting queries that actually make use of such connections, i.e. by traversing the graph. Examples of suitable applications include social networks, recommendation engines (e.g. “show me movies that my best friends like”) and many other cases of link-rich domains.

Quick Installation

From the Neo4j web-site, we can download the community edition of Neo4j. At the moment of this writing, the last version is 2.2.0, which provides improved performance and a re-design of the UI. To install the software, simply unzip it:

tar zxf neo4j-community-2.2.0-unix.tar.gz
ln -s neo4j-community-2.2.0 neo4j

We can immediately run the server:

cd neo4j
./bin/neo4j start

and now we can point the browser to http://localhost:7474 for a nice web GUI. The first time you open the interface, you’ll be asked to set a password for the user “neo4j”.

If you want to stop the server, you can type:

./bin/neo4j stop

Interfacing with Python

There is no shortage of Neo4j clients available for several programming languages, including Python. An interesting project, which makes use of the Neo4j REST interface, is Neo4jRestClient. Quick installation:

pip install neo4jrestclient

All the features of this client are listed in the docs.

Creating a sample graph

Let’s start with a simple social-network-like application, where users know each others and like different “things”. In this example, users and things will be nodes in our database. Each node can be associated with labels, used to describe the type of node. The following code will create two nodes labelled as User and two nodes labelled as Beer:

from neo4jrestclient.client import GraphDatabase

db = GraphDatabase("http://localhost:7474", username="neo4j", password="mypassword")

# Create some nodes with labels
user = db.labels.create("User")
u1 = db.nodes.create(name="Marco")
user.add(u1)
u2 = db.nodes.create(name="Daniela")
user.add(u2)

beer = db.labels.create("Beer")
b1 = db.nodes.create(name="Punk IPA")
b2 = db.nodes.create(name="Hoegaarden Rosee")
# You can associate a label with many nodes in one go
beer.add(b1, b2)

The second step is all about connecting the dots, which in graph DB terminology means creating the relationships.

# User-likes-&amp;gt;Beer relationships
u1.relationships.create("likes", b1)
u1.relationships.create("likes", b2)
u2.relationships.create("likes", b1)
# Bi-directional relationship?
u1.relationships.create("friends", u2)

We notice that relationships have a direction, so we can easily model subject-predicate-object kind of relationships. In case we need to model bi-directional relationship, like in a friend-of link in a social network, there are essentially two options:

  • Add two edge per relationship, one for each direction
  • Add one edge per relationship, with an arbitrary direction, and then ignoring the direction in the query

In this example, we’re following the second option.

Querying the graph

The Neo4j Browser available at http://localhost:7474/ provides a nice way to query the DB and visualise the results, both as a list of record and in a visual form.

The query language for Neo4j is called Cypher. It allows to describe patterns in graphs, in a declarative fashion, i.e. just like SQL, you describe what you want, rather then how to retrieve it. Cypher uses some sort of ASCII-art to describe nodes, relationships and their direction.

For example, we can retrieve our whole graph using the following Cypher query:

MATCH (n)-[r]-&amp;gt;(m) RETURN n, r, m;

And the outcome in the browser:

N4oj Browser, results of a query

In plain English, what the query is trying to match is “any node n, linked to a node m via a relationship r“. Suggestion: with a huge graph, use a LIMIT clause.

Of course we can also embed Cypher in our Python app, for example:

from neo4jrestclient import client

q = 'MATCH (u:User)-[r:likes]-&amp;gt;(m:Beer) WHERE u.name="Marco" RETURN u, type(r), m'
# "db" as defined above
results = db.query(q, returns=(client.Node, str, client.Node))
for r in results:
    print("(%s)-[%s]-&amp;gt;(%s)" % (r[0]["name"], r[1], r[2]["name"]))
# The output:
# (Marco)-[likes]-&amp;gt;(Punk IPA)
# (Marco)-[likes]-&amp;gt;(Hoegaarden Rosee)

The above query will retrieve all the triplets User-likes-Beer for the user Marco. The results variable will be a list of tuples, matching the format that we gave in Cypher with the RETURN keyword.

Summary

Graph databases, one of the NoSQL flavours, provide an interesting way to model data with rich interconnections. Examples of applications that are particularly suitable for graph databases are social networks and recommendation systems. This article has introduced Neo4j, one of the main examples of Graph DB, and its use with Python using the Neo4j REST client. We have seen how to create nodes and relationships, and how to query the graph using Cypher, the Neo4j query language.

Mining Twitter Data with Python: Part 5 – Data Visualisation Basics

A picture is worth a thousand tweets: more often than not, designing a good visual representation of our data, can help us make sense of them and highlight interesting insights. After collecting and analysing Twitter data, the tutorial continues with some notions on data visualisation with Python.

Tutorial Table of Contents:

From Python to Javascript with Vincent

While there are some options to create plots in Python using libraries like matplotlib or ggplot, one of the coolest libraries for data visualisation is probably D3.js which is, as the name suggests, based on Javascript. D3 plays well with web standards like CSS and SVG, and allows to create some wonderful interactive visualisations.

Vincent bridges the gap between a Python back-end and a front-end that supports D3.js visualisation, allowing us to benefit from both sides. The tagline of Vincent is in fact “The data capabilities of Python. The visualization capabilities of JavaScript”. Vincent, a Python library, takes our data in Python format and translates them into Vega, a JSON-based visualisation grammar that will be used on top of D3. It sounds quite complicated, but it’s fairly simple and pythonic. You don’t have to write a line in Javascript/D3 if you don’t want to.

Firstly, let’s install Vincent:

pip install vincent

Secondly, let’s create our first plot. Using the list of most frequent terms (without hashtags) from our rugby data set, we want to plot their frequencies:

import vincent

word_freq = count_terms_only.most_common(20)
labels, freq = zip(*word_freq)
data = {'data': freq, 'x': labels}
bar = vincent.Bar(data, iter_idx='x')
bar.to_json('term_freq.json')

At this point, the file term_freq.json will contain a description of the plot that can be handed over to D3.js and Vega. A simple template (taken from Vincent resources) to visualise the plot:

&lt;html&gt;
&lt;head&gt;
    &lt;title&gt;Vega Scaffold&lt;/title&gt;
    &lt;script src="http://d3js.org/d3.v3.min.js" charset="utf-8"&gt;&lt;/script&gt;
    &lt;script src="http://d3js.org/topojson.v1.min.js"&gt;&lt;/script&gt;
    &lt;script src="http://d3js.org/d3.geo.projection.v0.min.js" charset="utf-8"&gt;&lt;/script&gt;
    &lt;script src="http://trifacta.github.com/vega/vega.js"&gt;&lt;/script&gt;
&lt;/head&gt;
&lt;body&gt;
    &lt;div id="vis"&gt;&lt;/div&gt;
&lt;/body&gt;
&lt;script type="text/javascript"&gt;
// parse a spec and create a visualization view
function parse(spec) {
  vg.parse.spec(spec, function(chart) { chart({el:"#vis"}).update(); });
}
parse("term_freq.json");
&lt;/script&gt;
&lt;/html&gt;

Save the above HTML page as chart.html and run the simple Python web server:

python -m http.server 8888 # Python 3
python -m SimpleHTTPServer 8888 # Python 2

Now you can open your browser at http://localhost:8888/chart.html and observe the result:

Term Frequencies

Notice: you could save the HTML template directly from Python with:

bar.to_json('term_freq.json', html_out=True, html_path='chart.html')

but, at least in Python 3, the output is not a well formed HTML and you’d need to manually strip some characters.

With this procedure, we can plot many different types of charts with Vincent. Let’s take a moment to browse the docs and see its capabilities.

Time Series Visualisation

Another interesting aspect of analysing data from Twitter is the possibility to observe the distribution of tweets over time. In other words, if we organise the frequencies into temporal buckets, we could observe how Twitter users react to real-time events.

One of my favourite tools for data analysis with Python is Pandas, which also has a fairly decent support for time series. As an example, let’s track the hashtag #ITAvWAL to observe what happened during the first match.

Firstly, if we haven’t done it yet, we need to install Pandas:

pip install pandas

In the main loop which reads all the tweets, we simply track the occurrences of the hashtag, i.e. we can refactor the code from the previous episodes into something similar to:

import pandas
import json

dates_ITAvWAL = []
# f is the file pointer to the JSON data set
for line in f:
    tweet = json.loads(line)
    # let's focus on hashtags only at the moment
    terms_hash = [term for term in preprocess(tweet['text']) if term.startswith('#')]
    # track when the hashtag is mentioned
    if '#itavwal' in terms_hash:
        dates_ITAvWAL.append(tweet['created_at'])

# a list of "1" to count the hashtags
ones = [1]*len(dates_ITAvWAL)
# the index of the series
idx = pandas.DatetimeIndex(dates_ITAvWAL)
# the actual series (at series of 1s for the moment)
ITAvWAL = pandas.Series(ones, index=idx)

# Resampling / bucketing
per_minute = ITAvWAL.resample('1Min', how='sum').fillna(0)

The last line is what allows us to track the frequencies over time. The series is re-sampled with intervals of 1 minute. This means all the tweets falling within a particular minute will be aggregated, more precisely they will be summed up, given how='sum'. The time index will not keep track of the seconds anymore. If there is no tweet in a particular minute, the fillna() function will fill the blanks with zeros.

To put the time series in a plot with Vincent:

time_chart = vincent.Line(ITAvWAL)
time_chart.axis_titles(x='Time', y='Freq')
time_chart.to_json('time_chart.json')

Once you embed the time_chart.json file into the HTML template discussed above, you’ll see this output:

Time Series

The interesting moments of the match are observable from the spikes in the series. The first spike just before 1pm corresponds to the first Italian try. All the other spikes between 1:30 and 2:30pm correspond to Welsh tries and show the Welsh dominance during the second half. The match was over by 2:30, so after that Twitter went quiet.

Rather than just observing one sequence at a time, we could compare different series to observe how the matches has evolved. So let’s refactor the code for the time series, keeping track of the three different hashtags #ITAvWAL, #SCOvIRE and #ENGvFRA into the corresponding pandas.Series.

# all the data together
match_data = dict(ITAvWAL=per_minute_i, SCOvIRE=per_minute_s, ENGvFRA=per_minute_e)
# we need a DataFrame, to accommodate multiple series
all_matches = pandas.DataFrame(data=match_data,
                               index=per_minute_i.index)
# Resampling as above
all_matches = all_matches.resample('1Min', how='sum').fillna(0)

# and now the plotting
time_chart = vincent.Line(all_matches[['ITAvWAL', 'SCOvIRE', 'ENGvFRA']])
time_chart.axis_titles(x='Time', y='Freq')
time_chart.legend(title='Matches')
time_chart.to_json('time_chart.json')

And the output:

time2

We can immediately observe when the different matches took place (approx 12:30-2:30, 2:30-4:30 and 5-7) and we can see how the last match had the all the attentions, especially in the end when the winner was revealed.

Summary

Data visualisation is an important discipline in the bigger context of data analysis. By supporting visual representations of our data, we can provide interesting insights. We have discussed a relatively simple option to support data visualisation with Python using Vincent. In particular, we have seen how we can easily bridge the gap between Python and a language like Javascript that offers a great tool like D3.js, one of the most important libraries for interactive visualisation. Overall, we have just scratched the surface of data visualisation, but as a starting point this should be enough to get some nice ideas going. The nature of Twitter as a medium has also encouraged a quick look into the topic of time series analysis, allowing us to mention pandas as a great Python tool.

If this article has given you some ideas for data visualisation, please leave a comment below or get in touch.

@MarcoBonzanini

Tutorial Table of Contents:

Mining Twitter Data with Python (Part 4: Rugby and Term Co-occurrences)

Last Saturday was the closing day of the Six Nations Championship, an annual international rugby competition. Before turning on the TV to watch Italy being trashed by Wales, I decided to use this event to collect some data from Twitter and perform some exploratory text analysis on something more interesting than the small list of my tweets.

This article continues the tutorial on Twitter Data Mining, re-using what we discussed in the previous articles with some more realistic data. It also expands the analysis by introducing the concept of term co-occurrence.

Tutorial Table of Contents:

The Application Domain

As the name suggests, six teams are involved in the competition: England, Ireland, Wales, Scotland, France and Italy. This means that we can expect the event to be tweeted in multiple languages (English, French, Italian, Welsh, Gaelic, possibly other languages as well), with English being the major language. Assuming the team names will be mentioned frequently, we could decide to look also for their nicknames, e.g. Les Bleus for France or Azzurri for Italy. During the last day of the competition, three matches are played sequentially. Three teams in particular had a shot for the title: England, Ireland and Wales. At the end, Ireland won the competition but everything was open until the very last minute.

Setting Up

I used the streaming API to download all the tweets containing the string #rbs6nations during the day. Obviously not all the tweets about the event contained the hashtag, but this is a good baseline. The time frame for the download was from around 12:15PM to 7:15PM GMT, that is from about 15 minutes before the first match, to about 15 minutes after the last match was over. At the end, more than 18,000 tweets have been downloaded in JSON format, making for about 75Mb of data. This should be small enough to quickly do some processing in memory, and at the same time big enough to observe something possibly interesting.

The textual content of the tweets has been pre-processed with tokenisation and lowercasing using the preprocess() function introduced in Part 2 of the tutorial.

Interesting terms and hashtags

Following what we discussed in Part 3 (Term Frequencies), we want to observe the most common terms and hashtags used during day. If you have followed the discussion about creating different lists of tokens in order to capture terms without hashtags, hashtags only, removing stop-words, etc. you can play around with the different lists.

This is the unsurprising list of top 10 most frequent terms (terms_only in Part 3) in the data set.

[('ireland', 3163), ('england', 2584), ('wales', 2271), ('…', 2068), ('day', 1479), ('france', 1380), ('win', 1338), ('rugby', 1253), ('points', 1221), ('title', 1180)]

The first three terms correspond to the teams who had a go for the title. The frequencies also respect the order in the final table. The fourth term is instead a punctuation mark that we missed and didn’t include in the list of stop-words. This is because string.punctuation only contains ASCII symbols, while here we’re dealing with a unicode character. If we dig into the data, there will be more examples like this, but for the moment we don’t worry about it.

After adding the suspension-points symbol to the list of stop-words, we have a new entry at the end of the list:

[('ireland', 3163), ('england', 2584), ('wales', 2271), ('day', 1479), ('france', 1380), ('win', 1338), ('rugby', 1253), ('points', 1221), ('title', 1180), ('🍀', 1154)]

Interestingly, a new token we didn’t account for, an Emoji symbol (in this case, the Irish Shamrock).

If we have a look at the most common hashtags, we need to consider that #rbs6nations will be by far the most common token (that’s our search term for downloading the tweets), so we can exclude it from the list. This leave us with:

[('#engvfra', 1701), ('#itavwal', 927), ('#rugby', 880), ('#scovire', 692), ('#ireland', 686), ('#angfra', 554), ('#xvdefrance', 508), ('#crunch', 500), ('#wales', 446), ('#england', 406)]

We can observe that the most common hashtags, a part from #rugby, are related to the individual matches. In particular England v France has received the highest number of mentions, probably being the last match of the day with a dramatic finale. Something interesting to notice is that a fair amount of tweets also contained terms in French: the count for #angfra should in fact be added to #engvfra. Those unfamiliar with rugby probably wouldn’t recognise that also #crunch should be included with #EngvFra match, as Le Crunch is the traditional name for this event. So by far, the last match has received a lot of attention.

Term co-occurrences

Sometimes we are interested in the terms that occur together. This is mainly because the context gives us a better insight about the meaning of a term, supporting applications such as word disambiguation or semantic similarity. We discussed the option of using bigrams in the previous article, but we want to extend the context of a term to the whole tweet.

We can refactor the code from the previous article in order to capture the co-occurrences. We build a co-occurrence matrix com such that com[x][y] contains the number of times the term x has been seen in the same tweet as the term y:

from collections import defaultdict
# remember to include the other import from the previous post

com = defaultdict(lambda : defaultdict(int))

# f is the file pointer to the JSON data set
for line in f: 
    tweet = json.loads(line)
    terms_only = [term for term in preprocess(tweet['text']) 
                  if term not in stop 
                  and not term.startswith(('#', '@'))]

    # Build co-occurrence matrix
    for i in range(len(terms_only)-1):            
        for j in range(i+1, len(terms_only)):
            w1, w2 = sorted([terms_only[i], terms_only[j]])                
            if w1 != w2:
                com[w1][w2] += 1

While building the co-occurrence matrix, we don’t want to count the same term pair twice, e.g. com[A][B] == com[B][A], so the inner for loop starts from i+1 in order to build a triangular matrix, while sorted will preserve the alphabetical order of the terms.

For each term, we then extract the 5 most frequent co-occurrent terms, creating a list of tuples in the form ((term1, term2), count):

com_max = []
# For each term, look for the most common co-occurrent terms
for t1 in com:
    t1_max_terms = sorted(com[t1].items(), key=operator.itemgetter(1), reverse=True)[:5]
    for t2, t2_count in t1_max_terms:
        com_max.append(((t1, t2), t2_count))
# Get the most frequent co-occurrences
terms_max = sorted(com_max, key=operator.itemgetter(1), reverse=True)
print(terms_max[:5])

The results:

[(('6', 'nations'), 845), (('champions', 'ireland'), 760), (('nations', 'rbs'), 742), (('day', 'ireland'), 731), (('ireland', 'wales'), 674)]

This implementation is pretty straightforward, but depending on the data set and on the use of the matrix, one might want to look into tools like scipy.sparse for building a sparse matrix.

We could also look for a specific term and extract its most frequent co-occurrences. We simply need to modify the main loop including an extra counter, for example:

search_word = sys.argv[1] # pass a term as a command-line argument
count_search = Counter()
for line in f:
    tweet = json.loads(line)
    terms_only = [term for term in preprocess(tweet['text']) 
                  if term not in stop 
                  and not term.startswith(('#', '@'))]
    if search_word in terms_only:
        count_search.update(terms_only)
print("Co-occurrence for %s:" % search_word)
print(count_search.most_common(20))

The outcome for “ireland”:

[('champions', 756), ('day', 727), ('nations', 659), ('wales', 654), ('2015', 638), ('6', 613), ('rbs', 585), ('http://t.co/y0nvsvayln', 559), ('🍀', 526), ('10', 522), ('win', 377), ('england', 377), ('twickenham', 361), ('40', 360), ('points', 356), ('sco', 355), ('ire', 355), ('title', 346), ('scotland', 301), ('turn', 295)]

The outcome for “rugby”:

[('day', 476), ('game', 160), ('ireland', 143), ('england', 132), ('great', 105), ('today', 104), ('best', 97), ('well', 90), ('ever', 89), ('incredible', 87), ('amazing', 84), ('done', 82), ('amp', 71), ('games', 66), ('points', 64), ('monumental', 58), ('strap', 56), ('world', 55), ('team', 55), ('http://t.co/bhmeorr19i', 53)]

Overall, quite interesting.

Summary

This article has discussed a toy example of Text Mining on Twitter, using some realistic data taken during a sport event. Using what we have learnt in the previous episodes, we have downloaded some data using the streaming API, pre-processed the data in JSON format and extracted some interesting terms and hashtags from the tweets. The article has also introduced the concept of term co-occurrence, shown how to build a co-occurrence matrix and discussed how to use it to find some interesting insight.

@MarcoBonzanini