Some Thoughts on IWCS 2015

Last week I attended the 11th International Conference on Computational Semantics (IWCS 2015). This conference is the bi-yearly meeting of the ACL‘s Special Interest Group on Semantics and this edition was hosted by Queen Mary University’s Computational Linguistics Lab. The topics discussed at the conference revolve around computing, annotating, extracting and representing meaning in natural language. The format of the conference consisted in a first day of workshops (I attended Advances in Distributional Semantics, ADS) followed by three days of main event.

It was nice to be back at Queen Mary, where I studied for my MSc and PhD in Information Retrieval, and it was nice to have the opportunity to mix with a different academic crowd. In fact, a part from the local organisers, I hadn’t met any of the attendees before, and I only knew a couple of famous names. In particular, Hinrich Schuetze (probably best known for co-authoring a book on Natural Language Processing and one on Information Retrieval) gave a talk at the ADS workshop about The case against compositionality, and Yoshua Bengio (one of the most influencial figures in Deep Learning) gave one of the keynote speeches, about Deep Learning of Semantic Representations.

To confirm a feeling that I already had, I have to say that small, single-track conferences are in general more enjoyable than huge ones. You might not have an open bar reception in a 5+ stars fancy hotel, but the networking is much more relaxed, people are in general more approachable, and QA sessions are usually spot on. Of course it really depends on the venues, but my non-statistically significant experience tells me so. Moreover, in bigger venues a lot of attention goes to improving some baseline of some 0.1% accuracy (or whatever metric) without many details on the theoretical foundations (of course with exceptions). Smaller venues usually have the chance to dig deeper into what it is that really makes a model interesting, even when the results are less solid or the evaluation is on a small-ish scale.

Talking about evaluation, this was in my eyes the biggest difference with Information Retrieval conferences: scalability and large-scale evaluation have been rarely, if ever, mentioned. I understand that other venues like EMNLP are probably more suitable for these topics, but it was something that I noticed.

In general, it’s difficult to mention one particular talk, as they were all more or less interesting in my eyes, but one quote that stood out for me was an answer given by Prof. Bengio at the end of his keynote, regarding negation and quantification, and how a Neural Network model deals with them: “I don’t know. But it learns to do what it needs to do.”

As a final side-note, the social event/dinner was a boat trip on the Thames: looking at some well-known London landmarks from a different point of view was absolutely amazing. Well done to the organisers!

Getting started with Neo4j and Python

This article is a brief introduction to Neo4j, one of the most popular graph databases, and its integration with Python.

Graph Databases

Graph databases are a family of NoSQL databases, based on the concept of modelling your data as a graph, i.e. a collection of nodes (representing entities) and edges (representing relationships).

The motivation behind the use of a graph database is the need to model small records which are deeply interconnected, forming a complex web that is difficult to represent in a relational fashion. Graph databases are particularly good at supporting queries that actually make use of such connections, i.e. by traversing the graph. Examples of suitable applications include social networks, recommendation engines (e.g. “show me movies that my best friends like”) and many other cases of link-rich domains.

Quick Installation

From the Neo4j web-site, we can download the community edition of Neo4j. At the moment of this writing, the last version is 2.2.0, which provides improved performance and a re-design of the UI. To install the software, simply unzip it:

tar zxf neo4j-community-2.2.0-unix.tar.gz
ln -s neo4j-community-2.2.0 neo4j

We can immediately run the server:

cd neo4j
./bin/neo4j start

and now we can point the browser to http://localhost:7474 for a nice web GUI. The first time you open the interface, you’ll be asked to set a password for the user “neo4j”.

If you want to stop the server, you can type:

./bin/neo4j stop

Interfacing with Python

There is no shortage of Neo4j clients available for several programming languages, including Python. An interesting project, which makes use of the Neo4j REST interface, is Neo4jRestClient. Quick installation:

pip install neo4jrestclient

All the features of this client are listed in the docs.

Creating a sample graph

Let’s start with a simple social-network-like application, where users know each others and like different “things”. In this example, users and things will be nodes in our database. Each node can be associated with labels, used to describe the type of node. The following code will create two nodes labelled as User and two nodes labelled as Beer:

from neo4jrestclient.client import GraphDatabase

db = GraphDatabase("http://localhost:7474", username="neo4j", password="mypassword")

# Create some nodes with labels
user = db.labels.create("User")
u1 = db.nodes.create(name="Marco")
user.add(u1)
u2 = db.nodes.create(name="Daniela")
user.add(u2)

beer = db.labels.create("Beer")
b1 = db.nodes.create(name="Punk IPA")
b2 = db.nodes.create(name="Hoegaarden Rosee")
# You can associate a label with many nodes in one go
beer.add(b1, b2)

The second step is all about connecting the dots, which in graph DB terminology means creating the relationships.

# User-likes->Beer relationships
u1.relationships.create("likes", b1)
u1.relationships.create("likes", b2)
u2.relationships.create("likes", b1)
# Bi-directional relationship?
u1.relationships.create("friends", u2)

We notice that relationships have a direction, so we can easily model subject-predicate-object kind of relationships. In case we need to model bi-directional relationship, like in a friend-of link in a social network, there are essentially two options:

  • Add two edge per relationship, one for each direction
  • Add one edge per relationship, with an arbitrary direction, and then ignoring the direction in the query

In this example, we’re following the second option.

Querying the graph

The Neo4j Browser available at http://localhost:7474/ provides a nice way to query the DB and visualise the results, both as a list of record and in a visual form.

The query language for Neo4j is called Cypher. It allows to describe patterns in graphs, in a declarative fashion, i.e. just like SQL, you describe what you want, rather then how to retrieve it. Cypher uses some sort of ASCII-art to describe nodes, relationships and their direction.

For example, we can retrieve our whole graph using the following Cypher query:

MATCH (n)-[r]->(m) RETURN n, r, m;

And the outcome in the browser:

N4oj Browser, results of a query

In plain English, what the query is trying to match is “any node n, linked to a node m via a relationship r“. Suggestion: with a huge graph, use a LIMIT clause.

Of course we can also embed Cypher in our Python app, for example:

from neo4jrestclient import client

q = 'MATCH (u:User)-[r:likes]->(m:Beer) WHERE u.name="Marco" RETURN u, type(r), m'
# "db" as defined above
results = db.query(q, returns=(client.Node, str, client.Node))
for r in results:
    print("(%s)-[%s]->(%s)" % (r[0]["name"], r[1], r[2]["name"]))
# The output:
# (Marco)-[likes]->(Punk IPA)
# (Marco)-[likes]->(Hoegaarden Rosee)

The above query will retrieve all the triplets User-likes-Beer for the user Marco. The results variable will be a list of tuples, matching the format that we gave in Cypher with the RETURN keyword.

Summary

Graph databases, one of the NoSQL flavours, provide an interesting way to model data with rich interconnections. Examples of applications that are particularly suitable for graph databases are social networks and recommendation systems. This article has introduced Neo4j, one of the main examples of Graph DB, and its use with Python using the Neo4j REST client. We have seen how to create nodes and relationships, and how to query the graph using Cypher, the Neo4j query language.

Mining Twitter Data with Python: Part 5 – Data Visualisation Basics

A picture is worth a thousand tweets: more often than not, designing a good visual representation of our data, can help us make sense of them and highlight interesting insights. After collecting and analysing Twitter data, the tutorial continues with some notions on data visualisation with Python.

Tutorial Table of Contents:

From Python to Javascript with Vincent

While there are some options to create plots in Python using libraries like matplotlib or ggplot, one of the coolest libraries for data visualisation is probably D3.js which is, as the name suggests, based on Javascript. D3 plays well with web standards like CSS and SVG, and allows to create some wonderful interactive visualisations.

Vincent bridges the gap between a Python back-end and a front-end that supports D3.js visualisation, allowing us to benefit from both sides. The tagline of Vincent is in fact “The data capabilities of Python. The visualization capabilities of JavaScript”. Vincent, a Python library, takes our data in Python format and translates them into Vega, a JSON-based visualisation grammar that will be used on top of D3. It sounds quite complicated, but it’s fairly simple and pythonic. You don’t have to write a line in Javascript/D3 if you don’t want to.

Firstly, let’s install Vincent:

pip install vincent

Secondly, let’s create our first plot. Using the list of most frequent terms (without hashtags) from our rugby data set, we want to plot their frequencies:

import vincent

word_freq = count_terms_only.most_common(20)
labels, freq = zip(*word_freq)
data = {'data': freq, 'x': labels}
bar = vincent.Bar(data, iter_idx='x')
bar.to_json('term_freq.json')

At this point, the file term_freq.json will contain a description of the plot that can be handed over to D3.js and Vega. A simple template (taken from Vincent resources) to visualise the plot:

<html>
<head>
    <title>Vega Scaffold</title>
    <script src="http://d3js.org/d3.v3.min.js" charset="utf-8"></script>
    <script src="http://d3js.org/topojson.v1.min.js"></script>
    <script src="http://d3js.org/d3.geo.projection.v0.min.js" charset="utf-8"></script>
    <script src="http://trifacta.github.com/vega/vega.js"></script>
</head>
<body>
    <div id="vis"></div>
</body>
<script type="text/javascript">
// parse a spec and create a visualization view
function parse(spec) {
  vg.parse.spec(spec, function(chart) { chart({el:"#vis"}).update(); });
}
parse("term_freq.json");
</script>
</html>

Save the above HTML page as chart.html and run the simple Python web server:

python -m http.server 8888 # Python 3
python -m SimpleHTTPServer 8888 # Python 2

Now you can open your browser at http://localhost:8888/chart.html and observe the result:

Term Frequencies

Notice: you could save the HTML template directly from Python with:

bar.to_json('term_freq.json', html_out=True, html_path='chart.html')

but, at least in Python 3, the output is not a well formed HTML and you’d need to manually strip some characters.

With this procedure, we can plot many different types of charts with Vincent. Let’s take a moment to browse the docs and see its capabilities.

Time Series Visualisation

Another interesting aspect of analysing data from Twitter is the possibility to observe the distribution of tweets over time. In other words, if we organise the frequencies into temporal buckets, we could observe how Twitter users react to real-time events.

One of my favourite tools for data analysis with Python is Pandas, which also has a fairly decent support for time series. As an example, let’s track the hashtag #ITAvWAL to observe what happened during the first match.

Firstly, if we haven’t done it yet, we need to install Pandas:

pip install pandas

In the main loop which reads all the tweets, we simply track the occurrences of the hashtag, i.e. we can refactor the code from the previous episodes into something similar to:

import pandas
import json

dates_ITAvWAL = []
# f is the file pointer to the JSON data set
for line in f:
    tweet = json.loads(line)
    # let's focus on hashtags only at the moment
    terms_hash = [term for term in preprocess(tweet['text']) if term.startswith('#')]
    # track when the hashtag is mentioned
    if '#itavwal' in terms_hash:
        dates_ITAvWAL.append(tweet['created_at'])

# a list of "1" to count the hashtags
ones = [1]*len(dates_ITAvWAL)
# the index of the series
idx = pandas.DatetimeIndex(dates_ITAvWAL)
# the actual series (at series of 1s for the moment)
ITAvWAL = pandas.Series(ones, index=idx)

# Resampling / bucketing
per_minute = ITAvWAL.resample('1Min', how='sum').fillna(0)

The last line is what allows us to track the frequencies over time. The series is re-sampled with intervals of 1 minute. This means all the tweets falling within a particular minute will be aggregated, more precisely they will be summed up, given how='sum'. The time index will not keep track of the seconds anymore. If there is no tweet in a particular minute, the fillna() function will fill the blanks with zeros.

To put the time series in a plot with Vincent:

time_chart = vincent.Line(ITAvWAL)
time_chart.axis_titles(x='Time', y='Freq')
time_chart.to_json('time_chart.json')

Once you embed the time_chart.json file into the HTML template discussed above, you’ll see this output:

Time Series

The interesting moments of the match are observable from the spikes in the series. The first spike just before 1pm corresponds to the first Italian try. All the other spikes between 1:30 and 2:30pm correspond to Welsh tries and show the Welsh dominance during the second half. The match was over by 2:30, so after that Twitter went quiet.

Rather than just observing one sequence at a time, we could compare different series to observe how the matches has evolved. So let’s refactor the code for the time series, keeping track of the three different hashtags #ITAvWAL, #SCOvIRE and #ENGvFRA into the corresponding pandas.Series.

# all the data together
match_data = dict(ITAvWAL=per_minute_i, SCOvIRE=per_minute_s, ENGvFRA=per_minute_e)
# we need a DataFrame, to accommodate multiple series
all_matches = pandas.DataFrame(data=match_data,
                               index=per_minute_i.index)
# Resampling as above
all_matches = all_matches.resample('1Min', how='sum').fillna(0)

# and now the plotting
time_chart = vincent.Line(all_matches[['ITAvWAL', 'SCOvIRE', 'ENGvFRA']])
time_chart.axis_titles(x='Time', y='Freq')
time_chart.legend(title='Matches')
time_chart.to_json('time_chart.json')

And the output:

time2

We can immediately observe when the different matches took place (approx 12:30-2:30, 2:30-4:30 and 5-7) and we can see how the last match had the all the attentions, especially in the end when the winner was revealed.

Summary

Data visualisation is an important discipline in the bigger context of data analysis. By supporting visual representations of our data, we can provide interesting insights. We have discussed a relatively simple option to support data visualisation with Python using Vincent. In particular, we have seen how we can easily bridge the gap between Python and a language like Javascript that offers a great tool like D3.js, one of the most important libraries for interactive visualisation. Overall, we have just scratched the surface of data visualisation, but as a starting point this should be enough to get some nice ideas going. The nature of Twitter as a medium has also encouraged a quick look into the topic of time series analysis, allowing us to mention pandas as a great Python tool.

If this article has given you some ideas for data visualisation, please leave a comment below or get in touch.

@MarcoBonzanini

Tutorial Table of Contents: