September 2015 – Marco Bonzanini

Easy Text Analytics with the Dandelion API and Python

In the past few weeks, I’ve been playing around with some third-party Web APIs for Text Analytics, mainly for some side projects. This article is a short write-up of my experience with the Dandelion API.

Notice: I’m not affiliated with dandelion.eu and I’m not a paying customer, I’m simply using their basic (i.e. free) plan which is, at the moment, more than enough for my toy examples.

Quick Overview on the Dandelion API

The Dandelion API has a set of endpoints, for different text analytics tasks. In particular, they offer semantic analysis features for:

Entity Extraction
Text Similarity
Text Classification
Language Detection
Sentiment Analysis

As my attention was mainly on entity extraction and sentiment analysis, I’ll focus this article on the two related endpoints.

The basic (free) plan for Dandelion comes with a rate of 1,000 units/day (or approx 30,000 units/month). Different endpoints have a different unit cost, i.e. the entity extraction and sentiment analysis cost 1 unit per request, while the text similarity costs 3 units per request. If you need to pass a URL or HTML instead of plain text, you’ll need to add an extra unit. The API is optimised for short text, so if you’re passing more than 4,000 characters, you’ll be billed extra units accordingly.

Getting started

In order to test the Dandelion API, I’ve downloaded some tweets using the Twitter Stream API. You can have a look at a previous article to see how to get data from Twitter with Python.

As NASA recently found evidence of water on Mars, that’s one of the hot topics on social media at the moment, so let’s have a look at a couple of tweets:

So what you’re saying is we just found water on Mars…. But we can’t make an iPhone charger that won’t break after three weeks?
NASA found water on Mars while Chelsea fans are still struggling to find their team in the league table

(not trying to be funny with Apple/Chelsea fans here, I was trying to collect data to compare iPhone vs Android and some London football teams, but the water-on-Mars topic got all the attentions).

The Dandelion API also provides a Python client, but the use of the API is so simple that we can directly use a library like requests to communicate with the endpoints. If it’s not installed yet, you can simply use pip:

pip install requests

Entity Extraction

Assuming you’ve signed up for the service, you will have an application key and an application ID. You will need them to query the service. The docs also provide all the references for the available parameters, the URI to query and the response format. App ID and key are passed via the parameters $app_id and $app_key respectively (mind the initial $ symbol).

import requests
import json

DANDELION_APP_ID = 'YOUR-APP-ID'
DANDELION_APP_KEY = 'YOUR-APP-KEY'

ENTITY_URL = 'https://api.dandelion.eu/datatxt/nex/v1'

def get_entities(text, confidence=0.1, lang='en'):
    payload = {
        '$app_id': DANDELION_APP_ID,
        '$app_key': DANDELION_APP_KEY,
        'text': text,
        'confidence': confidence,
        'lang': lang,
        'social.hashtag': True,
        'social.mention': True
    }
    response = requests.get(ENTITY_URL, params=payload)
    return response.json()

def print_entities(data):
    for annotation in data['annotations']:
        print("Entity found: %s" % annotation['spot'])

if __name__ == '__main__':
    query = "So what you're saying is we just found water on Mars.... But we can't make an iPhone charger that won't break after three weeks?"
    response = get_entities(query)
    print(json.dumps(response, indent=4))

This will produce the pretty-printed JSON response from the Dandelion API. In particular, let’s have a look at the annotations:

{
    "annotations": [
        {
            "label": "Water on Mars",
            "end": 51,
            "id": 21857752,
            "start": 38,
            "spot": "water on Mars",
            "uri": "http://en.wikipedia.org/wiki/Water_on_Mars",
            "title": "Water on Mars",
            "confidence": 0.8435
        },
        {
            "label": "IPhone",
            "end": 82,
            "id": 8841749,
            "start": 76,
            "spot": "iPhone",
            "uri": "http://en.wikipedia.org/wiki/IPhone",
            "title": "IPhone",
            "confidence": 0.799
        }
    ],
    /* more JSON output here */
}

Interesting to see that “water on Mars” is one of the entities (rather than just “water” and “Mars” as separate entities). Both entities are linked to their Wikipedia page, and both come with a high level of confidence. It would be even more interesting to see a different granularity for entity extraction, as in this case there is an explicit mention of one specific aspect of the iPhone (the battery charger).

The code snippet above defines also a print_entities() function, that you can use to substitute the print statement, if you want to print out only the entity references. Keep in mind that the attribute spot will contain the text as it appears in the original input. The other attributes of the output are pretty much self-explanatory, but you can check out the docs for further details.

If we run the same code using the Chelsea-related tweet above, we can find the following entities:

{
    "annotations": [
        {
            "uri": "http://en.wikipedia.org/wiki/NASA",
            "title": "NASA",
            "spot": "NASA",
            "id": 18426568,
            "end": 4,
            "confidence": 0.8525,
            "start": 0,
            "label": "NASA"
        },
        {
            "uri": "http://en.wikipedia.org/wiki/Water_on_Mars",
            "title": "Water on Mars",
            "spot": "water on Mars",
            "id": 21857752,
            "end": 24,
            "confidence": 0.8844,
            "start": 11,
            "label": "Water on Mars"
        },
        {
            "uri": "http://en.wikipedia.org/wiki/Chelsea_F.C.",
            "title": "Chelsea F.C.",
            "spot": "Chelsea",
            "id": 7473,
            "end": 38,
            "confidence": 0.8007,
            "start": 31,
            "label": "Chelsea"
        }
    ],
    /* more JSON output here */
}

Overall, it looks quite interesting.

Sentiment Analysis

Sentiment Analysis is not an easy task, especially when performed on tweets (very little context, informal language, sarcasm, etc.).

Let’s try to use the Sentiment Analysis API with the same tweets:

import requests
import json

DANDELION_APP_ID = 'YOUR-APP-ID'
DANDELION_APP_KEY = 'YOUR-APP-KEY'

SENTIMENT_URL = 'https://api.dandelion.eu/datatxt/sent/v1'

def get_sentiment(text, lang='en'):
    payload = {
        '$app_id': DANDELION_APP_ID,
        '$app_key': DANDELION_APP_KEY,
        'text': text,
        'lang': lang
    }
    response = requests.get(SENTIMENT_URL, params=payload)
    return response.json()

if __name__ == '__main__':
    query = "So what you're saying is we just found water on Mars.... But we can't make an iPhone charger that won't break after three weeks?"
    response = get_sentiment(query)
    print(json.dumps(response, indent=4))

This will print the following output:

{
    "sentiment": {
        "score": -0.7,
        "type": "negative"
    },
    /* more JSON output here */
}

The “sentiment” attribute will give us a score (from -1, totally negative, to 1, totally positive), and a type, which is one between positive, negative and neutral.

The main limitation here is not identifying explicitely the object of the sentiment. Even if we cross-reference the entities extracted in the previous paragraph, how can we programmatically link the negative sentiment with one of them? Is the negative sentiment related to finding water on Mars, or on the iPhone? As mentioned in the previous paragraph, there is also an explicit mention to the battery charger, which is not capture by the APIs and which is the target of the sentiment for this example.

The Chelsea tweet above will also produce a negative score. After downloading some more data looking for some positive tweets, I found this:

Nothing feels better than finishing a client job that you’re super happy with. Today is a good day.

The output for the Sentiment Analysis API:

{
    "sentiment": {
        "score": 0.7333333333333334,
        "type": "positive"
    },
    /* more JSON output here */
}

Well, this one was probably very explicit.

Summary

Using a third-party API can be as easy as writing a couple of lines in Python, or it can be a major pain. I think the short examples here showcase that the “easy” in the title is well motivated.

It’s worth noting that this article is not a proper review of the Deandelion API, it’s more like a short diary entry of my experiments, so what I’m reporting here is not a rigorous evaluation.

Anyway, the feeling is quite positive for the Entity Extraction API. I did some test also using hash-tags with some acronyms, and the API was able to correctly point me to the related entity. Occasionally there are some pieces of text labelled as entities, completely out of scope. This happens mostly with some movie (or song, or album) titles appearing verbatim in the text, and probably labelled because of the little context you have in Twitter’s 140 characters.

On the Sentiment Analysis side, I think providing only one aggregated score for the whole text sometimes doesn’t give the full picture. While it makes sense in some sentiment classification task (e.g. movie reviews, product reviews, etc.), we have seen more and more work on aspect-based sentiment analysis, which is what provides the right level of granularity to understand more deeply what the users are saying. As I mentioned already, this is anyway not trivial.

Overall, I had some fun playing with this API and I think the authors did a good job in keeping it simple to use.

Getting Started with MongoDB and Python

MongoDB is one of the popular NoSQL databases. It uses a document-oriented, JSON-like approach to represent data, making the integration of semi-structured data fairly easy.

This article is an introduction on how to use PyMongo, the package to interact with MongoDB in Python, for basic interactions with the database.

MongoDB Basics

As mentioned in the opening paragraph, MongoDB is a document-oriented database. The “document” is hence the main concept in this family of databases. This means that the database stores and retrieves records called documents.

If modelled in a sensible fashion, documents are usually self-describing and self-contained, so everything you need to know about the document is represented within the document. The data format used by mongo is JSON, which is a very convenient way to encapsulate data.

An example of document in MongoDB:

{
    "title": "An article about MongoDB and Python",
    "content": "Some long discussion about MongoDB",
    "publication_date": "2015-09-07 10:00:00",
    "shares": {
        "twitter": 123,
        "facebook" 456,
        "linkedin": 789
    },
    "tags": ["python", "mongodb", "nosql"],
    "author": {
        "name": "Marco",
        "author_id": 1
    }
}

Documents are grouped into collections. In a comparison with the relational world, if a document is a row/record, a collection would be something similar to a table. A group of collections can be hosted on the same database, and the concept of database here is similar to the relational one (sometimes referred to as schema).

Being a JSON-style data store, MongoDB is often referred to as schemaless: fields can dynamically be added to documents without the need to define an explicit schema.

Note: schema design in complex applications is still a useful tool, and some careful thoughts should be spent when designing our document representation. Maybe it’s more helpful to think about it as dynamic schema rather than schemaless, as the word still conveys the idea of flexibility provided by document-oriented data stores, yet the importance of some sort of structure is not forgotten.

Quick Installation

MongoDB is available for most package managers (e.g. in rpm or deb format). You can also install the binary tarball by simply unzipping it and making it visible from your $PATH. From the official website, download the tarball for your operating system. At the time of this writing, the latest version is 3.0.6:

tar zxf mongodb-osx-x86_64-3.0.6
ln -s mongodb-osx-x86_64-3.0.6 mongodb

You can immediately run the server:

cd mongodb
./bin/mongod

The MongoDB daemon is now listening on localhost:27017, the default settings.

MongoDB + Python = PyMongo

PyMongo is the Python driver for MongoDB, and it can be installed via pip:

pip install pymongo

This should install the latest 3.0 version of PyMongo (the interface for previous 2.* versions could be slightly different, so pay attention to this detail).

The main component of PyMongo is the MongoClient class. In order to connect with the database, we’ll need an instance of the client:

from pymongo import MongoClient

client = MongoClient()

Databases and collections are created automatically if they don’t exist. Let’s create a database called tutorial and a collection called articles:

db = client['tutorial']
coll = db['articles']

We can now define an a document an insert it in the collection:

from datetime import datetime

doc = {
    "title": "An article about MongoDB and Python",
    "author": "Marco",
    "publication_date": datetime.utcnow(),
    # more fields
}

doc_id = coll.insert_one(doc).inserted_id

the insert_one() methods does exactly what its name says. The document is represented as a Python dictionary, which looks very similar to a JSON object. PyMongo handles the conversion between data types (e.g. a datetime object in Python will become an ISODate in MongoDB, booleans are converted from True to true, etc.).

We also append the inserted_id attribute, to capture the ID assigned by Mongo to the document. This sort of primary key is stored in the _id field:

print(doc_id)
# ObjectId('55eb163d3661250ae4232ba6')

The _id of a document is an instance of ObjectId, rather than a simple string. This is important to keep in mind if we try to retrieve a document by its ID:

# query by ObjectId
my_doc = coll.find_one('_id': doc_id)
print(my_doc)
# {'title': 'An article about MongoDB and Python', 'author': 'Marco', '_id': ObjectId('55eb163d3661250ae4232ba6'),  ...}

doc_id_str = str(doc_id)
print(doc_id_str)
# 55eb163d3661250ae4232ba6

# query by ID-string
my_doc = coll.find_one('_id': doc_id_str)
print(my_doc)
# empty

# Converting an ID-string to an ObjectId
from bson.objectid import ObjectId
my_doc = coll.find_one({'_id': ObjectId(doc_id_str)})
print(my_doc)
# {'title': 'An article about MongoDB and Python', 'author': 'Marco', '_id': ObjectId('55eb163d3661250ae4232ba6'), ...}

The string-vs-ObjectId mismatch problem is one of the first encountered by MongoDB novices, so it’s something to keep in mind.

We have introduced the insert_one() and find_one() functions, which work for one document. Their counterparts, insert_many() and find() will work for many documents. Specifically, insert_many() takes a list of documents as argument, i.e. a list of dictionaries. On the other side, find() will return a cursor, which can be used like an iterable, e.g.:

many_docs = coll.find() # empty query means "retrieve all"
for doc in many_docs:
    print(doc)

More Complex Queries

Given this data set:

doc1 = {
    "title": "Intro to MongoDB and Python",
    "publication_date": datetime(2015, 9, 7),
    "likes": 10
}
doc2 = {
    "title": "Intro to Neo4J and Python",
    "publication_date": datetime(2015, 9, 1),
    "likes": 5
}
doc3 = {
    "title": "Intro to Elasticsearch and Python",
    "publication_date": datetime(2015, 8, 1),
    "likes": 15
}
coll.insert_many([doc1, doc2, doc3])

print(coll.count())
# 3

We can filter the results to find documents older than a given date

results = coll.find({"publication_date": {'$lt': datetime(2015, 9, 1)}})
for doc in results:
    print(doc)
# {'title': 'Intro to Elasticsearch and Python', ...}

The $lt operator simply stands for less than, and of course it finds its counterpart in $gt. As expected, we also have $lte and $gte if we want to include also the limit (the date 2015-09-01 in the previous example):

results = coll.find({"publication_date": {'$lte': datetime(2015, 9, 1)}})
for doc in results:
    print(doc)
# {'title': 'Intro to Neo4J and Python', ...}
# {'title': 'Intro to Elasticsearch and Python', ...}

The results are in order of _id. If we need to sort the results according to a particular field, we can use the appropriate function:

from pymongo import ASCENDING, DESCENDING

# get all docs, sort by number of likes high-to-low
results = coll.find().sort("likes", DESCENDING)
for doc in results:
    print(doc)
# {'title': 'Intro to Elasticsearch and Python', "likes": 15, ...}
# {'title': 'Intro to MongoDB and Python', "likes": 10, ...}
# {'title': 'Intro to Neo4J and Python', "likes": 5, ...}

We can of course build more complex queries and combine them with the appropriate sorting. As the query itself is a Python dictionary, we can define it separately rather than in-line, just for readability:

query = {
    "publication_date": {
        "$gte": datetime(2015, 9, 1)
    },
    "likes": {
        "$gt": 5
    }
}
results = coll.find(query)
for doc in results:
    print(doc)
# {'title': 'Intro to MongoDB and Python', "likes": 10, ...}

Summary

This article has introduced PyMongo, the Python driver to interact with MongoDB. The interaction with MongoDB via Python is fairly straightforward, and we can be up and running with some basic queries quite quickly.

MongoDB is one of the popular NoSQL databases which uses a document-oriented data store. Its JSON-like format and its dynamic schema approach make the case for self-describing and self-contained documents.