Text Analytics – Marco Bonzanini

Mastering Social Media Mining with Python

Great news, my book on data mining for social media is finally out!

The title is Mastering Social Media Mining with Python. I’ve been working with Packt Publishing over the past few months, and in July the book has been finalised and released.

Links:

ebook and paperback on Packt Publishing (the publisher)
ebook and paperback on Amazon.com and Amazon UK
Companion code for the book on my GitHub
Author’s profile on goodreads (inc. book ratings/reviews)

As part of Packt’s Mastering series, the book assumes the readers already have some basic understanding of Python (e.g. for loops and classes), but more advanced concepts are discussed with examples. No particular experience with Social Media APIs and Data Mining is required. With 300+ pages, by the end of the book, the readers should be able to build their own data mining projects using data from social media and Python tools.

A bird’s eye view on the content:

Social Media, Social Data and Python
- Introduction on Social Media and Social Data: challenges and opportunities
- Introduction on Python tools for Data Science
- Overview on the use of public APIs to interact with social media platforms
#MiningTwitter: Hashtags, Topics and Time Series
- Interacting with the Twitter API in Python
- Twitter data: the anatomy of a tweet
- Entity analysis, text analysis, time series analysis on tweets
Users, Followers, and Communities on Twitter
- Analysing who follows whom
- Mining your followers
- Mining communities
- Visualising tweets on a map
Posts, Pages and User Interactions on Facebook
- Interacting the Facebook Graph API in Python
- Mining you posts
- Mining Facebook Pages
Topic analysis on Google Plus
- Interacting with the Google Plus API in Python
- Finding people and pages on G+
- Analysis of notes and activities on G+
Questions and Answers on Stack Exchange
- Interacting with the StackOverflow API in Python
- Text classification for question tags
Blogs, RSS, Wikipedia, and Natural Language Processing
- Blogs and web pages as social data Web scraping with Python
- Basics of text analytics on blog posts
- Information extraction from text
Mining All the Data!
- Interacting with many other APIs and types of objects
- Examples of interaction with YouTube, Yelp and GitHub
Linked Data and the Semantic Web
- The Web as Social Media
- Mining relations from DBpedia
- Mining geo coordinates

The detailed table of contents is shown on the Packt Pub’s page. Chapter 2 is also offered as free sample.

Please have a look at the companion code for the book on my GitHub, so you can have an idea of the applications discussed in the book.

Easy Text Analytics with the Dandelion API and Python

In the past few weeks, I’ve been playing around with some third-party Web APIs for Text Analytics, mainly for some side projects. This article is a short write-up of my experience with the Dandelion API.

Notice: I’m not affiliated with dandelion.eu and I’m not a paying customer, I’m simply using their basic (i.e. free) plan which is, at the moment, more than enough for my toy examples.

Quick Overview on the Dandelion API

The Dandelion API has a set of endpoints, for different text analytics tasks. In particular, they offer semantic analysis features for:

Entity Extraction
Text Similarity
Text Classification
Language Detection
Sentiment Analysis

As my attention was mainly on entity extraction and sentiment analysis, I’ll focus this article on the two related endpoints.

The basic (free) plan for Dandelion comes with a rate of 1,000 units/day (or approx 30,000 units/month). Different endpoints have a different unit cost, i.e. the entity extraction and sentiment analysis cost 1 unit per request, while the text similarity costs 3 units per request. If you need to pass a URL or HTML instead of plain text, you’ll need to add an extra unit. The API is optimised for short text, so if you’re passing more than 4,000 characters, you’ll be billed extra units accordingly.

Getting started

In order to test the Dandelion API, I’ve downloaded some tweets using the Twitter Stream API. You can have a look at a previous article to see how to get data from Twitter with Python.

As NASA recently found evidence of water on Mars, that’s one of the hot topics on social media at the moment, so let’s have a look at a couple of tweets:

So what you’re saying is we just found water on Mars…. But we can’t make an iPhone charger that won’t break after three weeks?
NASA found water on Mars while Chelsea fans are still struggling to find their team in the league table

(not trying to be funny with Apple/Chelsea fans here, I was trying to collect data to compare iPhone vs Android and some London football teams, but the water-on-Mars topic got all the attentions).

The Dandelion API also provides a Python client, but the use of the API is so simple that we can directly use a library like requests to communicate with the endpoints. If it’s not installed yet, you can simply use pip:

pip install requests

Entity Extraction

Assuming you’ve signed up for the service, you will have an application key and an application ID. You will need them to query the service. The docs also provide all the references for the available parameters, the URI to query and the response format. App ID and key are passed via the parameters $app_id and $app_key respectively (mind the initial $ symbol).

import requests
import json

DANDELION_APP_ID = 'YOUR-APP-ID'
DANDELION_APP_KEY = 'YOUR-APP-KEY'

ENTITY_URL = 'https://api.dandelion.eu/datatxt/nex/v1'

def get_entities(text, confidence=0.1, lang='en'):
    payload = {
        '$app_id': DANDELION_APP_ID,
        '$app_key': DANDELION_APP_KEY,
        'text': text,
        'confidence': confidence,
        'lang': lang,
        'social.hashtag': True,
        'social.mention': True
    }
    response = requests.get(ENTITY_URL, params=payload)
    return response.json()

def print_entities(data):
    for annotation in data['annotations']:
        print("Entity found: %s" % annotation['spot'])

if __name__ == '__main__':
    query = "So what you're saying is we just found water on Mars.... But we can't make an iPhone charger that won't break after three weeks?"
    response = get_entities(query)
    print(json.dumps(response, indent=4))

This will produce the pretty-printed JSON response from the Dandelion API. In particular, let’s have a look at the annotations:

{
    "annotations": [
        {
            "label": "Water on Mars",
            "end": 51,
            "id": 21857752,
            "start": 38,
            "spot": "water on Mars",
            "uri": "http://en.wikipedia.org/wiki/Water_on_Mars",
            "title": "Water on Mars",
            "confidence": 0.8435
        },
        {
            "label": "IPhone",
            "end": 82,
            "id": 8841749,
            "start": 76,
            "spot": "iPhone",
            "uri": "http://en.wikipedia.org/wiki/IPhone",
            "title": "IPhone",
            "confidence": 0.799
        }
    ],
    /* more JSON output here */
}

Interesting to see that “water on Mars” is one of the entities (rather than just “water” and “Mars” as separate entities). Both entities are linked to their Wikipedia page, and both come with a high level of confidence. It would be even more interesting to see a different granularity for entity extraction, as in this case there is an explicit mention of one specific aspect of the iPhone (the battery charger).

The code snippet above defines also a print_entities() function, that you can use to substitute the print statement, if you want to print out only the entity references. Keep in mind that the attribute spot will contain the text as it appears in the original input. The other attributes of the output are pretty much self-explanatory, but you can check out the docs for further details.

If we run the same code using the Chelsea-related tweet above, we can find the following entities:

{
    "annotations": [
        {
            "uri": "http://en.wikipedia.org/wiki/NASA",
            "title": "NASA",
            "spot": "NASA",
            "id": 18426568,
            "end": 4,
            "confidence": 0.8525,
            "start": 0,
            "label": "NASA"
        },
        {
            "uri": "http://en.wikipedia.org/wiki/Water_on_Mars",
            "title": "Water on Mars",
            "spot": "water on Mars",
            "id": 21857752,
            "end": 24,
            "confidence": 0.8844,
            "start": 11,
            "label": "Water on Mars"
        },
        {
            "uri": "http://en.wikipedia.org/wiki/Chelsea_F.C.",
            "title": "Chelsea F.C.",
            "spot": "Chelsea",
            "id": 7473,
            "end": 38,
            "confidence": 0.8007,
            "start": 31,
            "label": "Chelsea"
        }
    ],
    /* more JSON output here */
}

Overall, it looks quite interesting.

Sentiment Analysis

Sentiment Analysis is not an easy task, especially when performed on tweets (very little context, informal language, sarcasm, etc.).

Let’s try to use the Sentiment Analysis API with the same tweets:

import requests
import json

DANDELION_APP_ID = 'YOUR-APP-ID'
DANDELION_APP_KEY = 'YOUR-APP-KEY'

SENTIMENT_URL = 'https://api.dandelion.eu/datatxt/sent/v1'

def get_sentiment(text, lang='en'):
    payload = {
        '$app_id': DANDELION_APP_ID,
        '$app_key': DANDELION_APP_KEY,
        'text': text,
        'lang': lang
    }
    response = requests.get(SENTIMENT_URL, params=payload)
    return response.json()

if __name__ == '__main__':
    query = "So what you're saying is we just found water on Mars.... But we can't make an iPhone charger that won't break after three weeks?"
    response = get_sentiment(query)
    print(json.dumps(response, indent=4))

This will print the following output:

{
    "sentiment": {
        "score": -0.7,
        "type": "negative"
    },
    /* more JSON output here */
}

The “sentiment” attribute will give us a score (from -1, totally negative, to 1, totally positive), and a type, which is one between positive, negative and neutral.

The main limitation here is not identifying explicitely the object of the sentiment. Even if we cross-reference the entities extracted in the previous paragraph, how can we programmatically link the negative sentiment with one of them? Is the negative sentiment related to finding water on Mars, or on the iPhone? As mentioned in the previous paragraph, there is also an explicit mention to the battery charger, which is not capture by the APIs and which is the target of the sentiment for this example.

The Chelsea tweet above will also produce a negative score. After downloading some more data looking for some positive tweets, I found this:

Nothing feels better than finishing a client job that you’re super happy with. Today is a good day.

The output for the Sentiment Analysis API:

{
    "sentiment": {
        "score": 0.7333333333333334,
        "type": "positive"
    },
    /* more JSON output here */
}

Well, this one was probably very explicit.

Summary

Using a third-party API can be as easy as writing a couple of lines in Python, or it can be a major pain. I think the short examples here showcase that the “easy” in the title is well motivated.

It’s worth noting that this article is not a proper review of the Deandelion API, it’s more like a short diary entry of my experiments, so what I’m reporting here is not a rigorous evaluation.

Anyway, the feeling is quite positive for the Entity Extraction API. I did some test also using hash-tags with some acronyms, and the API was able to correctly point me to the related entity. Occasionally there are some pieces of text labelled as entities, completely out of scope. This happens mostly with some movie (or song, or album) titles appearing verbatim in the text, and probably labelled because of the little context you have in Twitter’s 140 characters.

On the Sentiment Analysis side, I think providing only one aggregated score for the whole text sometimes doesn’t give the full picture. While it makes sense in some sentiment classification task (e.g. movie reviews, product reviews, etc.), we have seen more and more work on aspect-based sentiment analysis, which is what provides the right level of granularity to understand more deeply what the users are saying. As I mentioned already, this is anyway not trivial.

Overall, I had some fun playing with this API and I think the authors did a good job in keeping it simple to use.

A Brief Introduction to Text Summarisation

In this article, I’ll discuss some aspects of text summarisation, the process of analysing a text document, or a set of documents, in order to produce a summary of its content. The overall purpose is to reduce the amount of information that a user has to digest in order to understand whether reading the whole document is relevant for its information need.

This article is a bird’s-eye view on the topic, to understand the different implications of the problem, rather than a detailed discussion on a specific implementation. The latter will be the subject of future articles.

Summarisation is one of the important tasks in text analytics and it’s an active area of (academic) research which involves mainly the Natural Language Processing and Information Retrieval communities.

Information Overload and the Need for a Good Summary

The core of the matter is the information overload we are experiencing on a daily basis. To put it simply, there is just too much information to digest, and not enough time to do it. The purpose of summarisation is to minimise the amount of information you have to go through, before you can grasp the overall concepts described in the document.

Summarisation can happen in different forms, but the key idea is to present the user with something short, yet informative.

To name just one example, let’s say you want to buy some product, and you’d like to get some opinions about such product. None of your friends owns the product, so you have a look at the on-line reviews: thousands and thousands of sentences to read. Are you going to read all of them? Do you read just some of them and hope to get the best insights?

This is how Google Shopping provides the user with a possible solution:

Example of Google Shopping Summary — Example of Summary from Google Shopping

In this image, the reviews about a popular gaming console are condensed, providing a distribution of ratings and a breakdown of different aspects about the product (e.g. picture/video or battery). The user can then decide to read further, by clicking on a specific aspect, or on a specific rating. Other popular on-line services offer similar

Maybe this is not a big issue when the value of the product is just a few pounds/dollas/euros, but the same problem will arise any time there is just too much to read, and not enough time.

Application Scenarios

As mentioned in the previous paragraph, every scenario where there is a lot of text to read can be a good application scenario for text summarisation. The following list is paraphrased from a tutorial given at ACL 2001 by Maybury and Many:

News summaries: what happened while I was away?
Physicians’ aids: summarise and compare the recommended treatments for this patient
Meeting summarisation: what happended at the meeting I missed?
Search engine result pages: snippets of the retrieved documents compared to the query
Intelligence: create 500-word file of a suspect
Small screens: create a screen-sized summary of a book/email
Aids for visually impaired: compact the text and read it out for a blind person

More examples:

Sentiment Analysis: give me pros and cons of a product
Social media: what are the trending topics today?

Text summarisation is not the only way to tackle the information overload in some of these scenarios, but it can play an important role and it can be used as a component of a more complex system that involves e.g. recommendations and search.

Properties of a Summary

Before we can build a summarisation system, we need to understand how the summary is going to be consumed.

There are many different ways to characterise a summary, here we summarise some of them.

Abstract vs. extract: do we rephrase the content or do we extract some of it? The first involves natural language generation, the latter involves e.g. phrase/sentence ranking.

Single vs Multi source: multiple sources can introduce discrepancies, confirming or contradicting some information. Reviews stating opposite opinions and experiences can be legitimate. News releases that contradict each others are problematic. How do we deal with duplicate content? How do we promote novelty?

Generic vs User-oriented: a generic summary is static, created once for all the users. A user-oriented summary is dynamic, tailored to a particular user profile or user session.

Summary Function: do we want to cover all the key points of the source, or just act as a preview? Do we provide an additional critical view on the source? Think about a movie, and compare its plot, its trailer and a review about it: they are all summaries of the movie, but they have different functions.

Summary Length: a summary should be… short. How short? Do we have a target length (number of words/sentences) or a compression rate (e.g. 5% of the source)?

Linguistic qualities: is the summary coherent? Is it fluent? Is it self-contained?

These are just some of the aspects to consider when building a summarisation application. The key question probably is: how is it going to help the user?

Evaluation: How Good is a Summary?

Evaluating a summary is a challenge in itself. The previous paragraph has opened the discussion for a variety of summarisation approaches, so in order to decide how good a summary is, we really need to put some more context.

In principle, there two orthogonal ways to evaluate summarisation: user-vs-system-based and intrinsic-vs-extrinsic. Let’s briefly discuss them.

User-based intrinsic: users are asked to judge the quality of the summary per se. A typical question could be as simple as “how did you like the summary?“, or something more complex regarding the coherence of the summary, or whether it was helpful to understand the full text.

User-based extrinsic: users are asked to complete a particular task. The quality of the summary is measured against how well the user performs on the task. Here, “how well” can involve, for example, accuracy or speed: does the summary improve the user’s performance?

System-based intrinsic: gold standard summaries are produced by human judges, and the system-generated summaries are compared against them. Some evaluation metrics are involved for the automatic generation of a score that allows summarisation systems to be compared. A common example is the ROUGE framework, based on n-gram overlaps.

System-based extrinsic: the system performs some other tasks (e.g. text classification), using the system-generated summary. The system performances are evaluated for these other tasks, with and without the use of the summarisation component.

In general, involving users is a longer and more expensive process, but can provide interesting insights in terms of summary quality. System-based summarisation with e.g. ROUGE can be useful for some initial comparison, when many potential system candidates are available and employing users to judge all of them could be simply not feasible.

Evaluating a summariser against a particular task (extrinsic evaluation) often helps to answer the initial question, how is the summary going to help the user?

Conclusions

To summarise :) there are a few aspects to consider before building a summarisation system.

This article has provided an overall introduction to the field, to highlight some of the key issues to think about.

Some follow-up article will provide more concrete examples with existing tools and actual implementation, to showcase the real use of text summarisation.