Last week I attended the 11th International Conference on Computational Semantics (IWCS 2015). This conference is the bi-yearly meeting of the ACL‘s Special Interest Group on Semantics and this edition was hosted by Queen Mary University’s Computational Linguistics Lab. The topics discussed at the conference revolve around computing, annotating, extracting and representing meaning in natural language. The format of the conference consisted in a first day of workshops (I attended Advances in Distributional Semantics, ADS) followed by three days of main event.
It was nice to be back at Queen Mary, where I studied for my MSc and PhD in Information Retrieval, and it was nice to have the opportunity to mix with a different academic crowd. In fact, a part from the local organisers, I hadn’t met any of the attendees before, and I only knew a couple of famous names. In particular, Hinrich Schuetze (probably best known for co-authoring a book on Natural Language Processing and one on Information Retrieval) gave a talk at the ADS workshop about The case against compositionality, and Yoshua Bengio (one of the most influencial figures in Deep Learning) gave one of the keynote speeches, about Deep Learning of Semantic Representations.
To confirm a feeling that I already had, I have to say that small, single-track conferences are in general more enjoyable than huge ones. You might not have an open bar reception in a 5+ stars fancy hotel, but the networking is much more relaxed, people are in general more approachable, and QA sessions are usually spot on. Of course it really depends on the venues, but my non-statistically significant experience tells me so. Moreover, in bigger venues a lot of attention goes to improving some baseline of some 0.1% accuracy (or whatever metric) without many details on the theoretical foundations (of course with exceptions). Smaller venues usually have the chance to dig deeper into what it is that really makes a model interesting, even when the results are less solid or the evaluation is on a small-ish scale.
Talking about evaluation, this was in my eyes the biggest difference with Information Retrieval conferences: scalability and large-scale evaluation have been rarely, if ever, mentioned. I understand that other venues like EMNLP are probably more suitable for these topics, but it was something that I noticed.
In general, it’s difficult to mention one particular talk, as they were all more or less interesting in my eyes, but one quote that stood out for me was an answer given by Prof. Bengio at the end of his keynote, regarding negation and quantification, and how a Neural Network model deals with them: “I don’t know. But it learns to do what it needs to do.”
As a final side-note, the social event/dinner was a boat trip on the Thames: looking at some well-known London landmarks from a different point of view was absolutely amazing. Well done to the organisers!
This article is a brief introduction to Neo4j, one of the most popular graph databases, and its integration with Python.
Graph databases are a family of NoSQL databases, based on the concept of modelling your data as a graph, i.e. a collection of nodes (representing entities) and edges (representing relationships).
The motivation behind the use of a graph database is the need to model small records which are deeply interconnected, forming a complex web that is difficult to represent in a relational fashion. Graph databases are particularly good at supporting queries that actually make use of such connections, i.e. by traversing the graph. Examples of suitable applications include social networks, recommendation engines (e.g. “show me movies that my best friends like”) and many other cases of link-rich domains.
From the Neo4j web-site, we can download the community edition of Neo4j. At the moment of this writing, the last version is 2.2.0, which provides improved performance and a re-design of the UI. To install the software, simply unzip it:
tar zxf neo4j-community-2.2.0-unix.tar.gz
ln -s neo4j-community-2.2.0 neo4j
We can immediately run the server:
and now we can point the browser to http://localhost:7474 for a nice web GUI. The first time you open the interface, you’ll be asked to set a password for the user “neo4j”.
If you want to stop the server, you can type:
Interfacing with Python
There is no shortage of Neo4j clients available for several programming languages, including Python. An interesting project, which makes use of the Neo4j REST interface, is Neo4jRestClient. Quick installation:
pip install neo4jrestclient
All the features of this client are listed in the docs.
Creating a sample graph
Let’s start with a simple social-network-like application, where users know each others and like different “things”. In this example, users and things will be nodes in our database. Each node can be associated with labels, used to describe the type of node. The following code will create two nodes labelled as User and two nodes labelled as Beer:
from neo4jrestclient.client import GraphDatabase
db = GraphDatabase("http://localhost:7474", username="neo4j", password="mypassword")
# Create some nodes with labels
user = db.labels.create("User")
u1 = db.nodes.create(name="Marco")
u2 = db.nodes.create(name="Daniela")
beer = db.labels.create("Beer")
b1 = db.nodes.create(name="Punk IPA")
b2 = db.nodes.create(name="Hoegaarden Rosee")
# You can associate a label with many nodes in one go
The second step is all about connecting the dots, which in graph DB terminology means creating the relationships.
We notice that relationships have a direction, so we can easily model subject-predicate-object kind of relationships. In case we need to model bi-directional relationship, like in a friend-of link in a social network, there are essentially two options:
Add two edge per relationship, one for each direction
Add one edge per relationship, with an arbitrary direction, and then ignoring the direction in the query
In this example, we’re following the second option.
Querying the graph
The Neo4j Browser available at http://localhost:7474/ provides a nice way to query the DB and visualise the results, both as a list of record and in a visual form.
The query language for Neo4j is called Cypher. It allows to describe patterns in graphs, in a declarative fashion, i.e. just like SQL, you describe what you want, rather then how to retrieve it. Cypher uses some sort of ASCII-art to describe nodes, relationships and their direction.
For example, we can retrieve our whole graph using the following Cypher query:
MATCH (n)-[r]-&gt;(m) RETURN n, r, m;
And the outcome in the browser:
In plain English, what the query is trying to match is “any node n, linked to a node m via a relationship r“. Suggestion: with a huge graph, use a LIMIT clause.
Of course we can also embed Cypher in our Python app, for example:
from neo4jrestclient import client
q = 'MATCH (u:User)-[r:likes]-&gt;(m:Beer) WHERE u.name="Marco" RETURN u, type(r), m'
# "db" as defined above
results = db.query(q, returns=(client.Node, str, client.Node))
for r in results:
print("(%s)-[%s]-&gt;(%s)" % (r["name"], r, r["name"]))
# The output:
# (Marco)-[likes]-&gt;(Punk IPA)
# (Marco)-[likes]-&gt;(Hoegaarden Rosee)
The above query will retrieve all the triplets User-likes-Beer for the user Marco. The results variable will be a list of tuples, matching the format that we gave in Cypher with the RETURN keyword.
Graph databases, one of the NoSQL flavours, provide an interesting way to model data with rich interconnections. Examples of applications that are particularly suitable for graph databases are social networks and recommendation systems. This article has introduced Neo4j, one of the main examples of Graph DB, and its use with Python using the Neo4j REST client. We have seen how to create nodes and relationships, and how to query the graph using Cypher, the Neo4j query language.
A picture is worth a thousand tweets: more often than not, designing a good visual representation of our data, can help us make sense of them and highlight interesting insights. After collecting and analysing Twitter data, the tutorial continues with some notions on data visualisation with Python.
but, at least in Python 3, the output is not a well formed HTML and you’d need to manually strip some characters.
With this procedure, we can plot many different types of charts with Vincent. Let’s take a moment to browse the docs and see its capabilities.
Time Series Visualisation
Another interesting aspect of analysing data from Twitter is the possibility to observe the distribution of tweets over time. In other words, if we organise the frequencies into temporal buckets, we could observe how Twitter users react to real-time events.
One of my favourite tools for data analysis with Python is Pandas, which also has a fairly decent support for time series. As an example, let’s track the hashtag #ITAvWAL to observe what happened during the first match.
Firstly, if we haven’t done it yet, we need to install Pandas:
pip install pandas
In the main loop which reads all the tweets, we simply track the occurrences of the hashtag, i.e. we can refactor the code from the previous episodes into something similar to:
dates_ITAvWAL = 
# f is the file pointer to the JSON data set
for line in f:
tweet = json.loads(line)
# let's focus on hashtags only at the moment
terms_hash = [term for term in preprocess(tweet['text']) if term.startswith('#')]
# track when the hashtag is mentioned
if '#itavwal' in terms_hash:
# a list of "1" to count the hashtags
ones = *len(dates_ITAvWAL)
# the index of the series
idx = pandas.DatetimeIndex(dates_ITAvWAL)
# the actual series (at series of 1s for the moment)
ITAvWAL = pandas.Series(ones, index=idx)
# Resampling / bucketing
per_minute = ITAvWAL.resample('1Min', how='sum').fillna(0)
The last line is what allows us to track the frequencies over time. The series is re-sampled with intervals of 1 minute. This means all the tweets falling within a particular minute will be aggregated, more precisely they will be summed up, given how='sum'. The time index will not keep track of the seconds anymore. If there is no tweet in a particular minute, the fillna() function will fill the blanks with zeros.
Once you embed the time_chart.json file into the HTML template discussed above, you’ll see this output:
The interesting moments of the match are observable from the spikes in the series. The first spike just before 1pm corresponds to the first Italian try. All the other spikes between 1:30 and 2:30pm correspond to Welsh tries and show the Welsh dominance during the second half. The match was over by 2:30, so after that Twitter went quiet.
Rather than just observing one sequence at a time, we could compare different series to observe how the matches has evolved. So let’s refactor the code for the time series, keeping track of the three different hashtags #ITAvWAL, #SCOvIRE and #ENGvFRA into the corresponding pandas.Series.
# all the data together
match_data = dict(ITAvWAL=per_minute_i, SCOvIRE=per_minute_s, ENGvFRA=per_minute_e)
# we need a DataFrame, to accommodate multiple series
all_matches = pandas.DataFrame(data=match_data,
# Resampling as above
all_matches = all_matches.resample('1Min', how='sum').fillna(0)
# and now the plotting
time_chart = vincent.Line(all_matches[['ITAvWAL', 'SCOvIRE', 'ENGvFRA']])
And the output:
We can immediately observe when the different matches took place (approx 12:30-2:30, 2:30-4:30 and 5-7) and we can see how the last match had the all the attentions, especially in the end when the winner was revealed.
If this article has given you some ideas for data visualisation, please leave a comment below or get in touch.