Text Mining – Marco Bonzanini

Great news, my book on data mining for social media is finally out!

The title is Mastering Social Media Mining with Python. I’ve been working with Packt Publishing over the past few months, and in July the book has been finalised and released.

Links:

ebook and paperback on Packt Publishing (the publisher)
ebook and paperback on Amazon.com and Amazon UK
Companion code for the book on my GitHub
Author’s profile on goodreads (inc. book ratings/reviews)

As part of Packt’s Mastering series, the book assumes the readers already have some basic understanding of Python (e.g. for loops and classes), but more advanced concepts are discussed with examples. No particular experience with Social Media APIs and Data Mining is required. With 300+ pages, by the end of the book, the readers should be able to build their own data mining projects using data from social media and Python tools.

A bird’s eye view on the content:

Social Media, Social Data and Python
- Introduction on Social Media and Social Data: challenges and opportunities
- Introduction on Python tools for Data Science
- Overview on the use of public APIs to interact with social media platforms
#MiningTwitter: Hashtags, Topics and Time Series
- Interacting with the Twitter API in Python
- Twitter data: the anatomy of a tweet
- Entity analysis, text analysis, time series analysis on tweets
Users, Followers, and Communities on Twitter
- Analysing who follows whom
- Mining your followers
- Mining communities
- Visualising tweets on a map
Posts, Pages and User Interactions on Facebook
- Interacting the Facebook Graph API in Python
- Mining you posts
- Mining Facebook Pages
Topic analysis on Google Plus
- Interacting with the Google Plus API in Python
- Finding people and pages on G+
- Analysis of notes and activities on G+
Questions and Answers on Stack Exchange
- Interacting with the StackOverflow API in Python
- Text classification for question tags
Blogs, RSS, Wikipedia, and Natural Language Processing
- Blogs and web pages as social data Web scraping with Python
- Basics of text analytics on blog posts
- Information extraction from text
Mining All the Data!
- Interacting with many other APIs and types of objects
- Examples of interaction with YouTube, Yelp and GitHub
Linked Data and the Semantic Web
- The Web as Social Media
- Mining relations from DBpedia
- Mining geo coordinates

The detailed table of contents is shown on the Packt Pub’s page. Chapter 2 is also offered as free sample.

Please have a look at the companion code for the book on my GitHub, so you can have an idea of the applications discussed in the book.

In this article, I’ll discuss some aspects of text summarisation, the process of analysing a text document, or a set of documents, in order to produce a summary of its content. The overall purpose is to reduce the amount of information that a user has to digest in order to understand whether reading the whole document is relevant for its information need.

This article is a bird’s-eye view on the topic, to understand the different implications of the problem, rather than a detailed discussion on a specific implementation. The latter will be the subject of future articles.

Summarisation is one of the important tasks in text analytics and it’s an active area of (academic) research which involves mainly the Natural Language Processing and Information Retrieval communities.

Information Overload and the Need for a Good Summary

The core of the matter is the information overload we are experiencing on a daily basis. To put it simply, there is just too much information to digest, and not enough time to do it. The purpose of summarisation is to minimise the amount of information you have to go through, before you can grasp the overall concepts described in the document.

Summarisation can happen in different forms, but the key idea is to present the user with something short, yet informative.

To name just one example, let’s say you want to buy some product, and you’d like to get some opinions about such product. None of your friends owns the product, so you have a look at the on-line reviews: thousands and thousands of sentences to read. Are you going to read all of them? Do you read just some of them and hope to get the best insights?

This is how Google Shopping provides the user with a possible solution:

Example of Google Shopping Summary — Example of Summary from Google Shopping

In this image, the reviews about a popular gaming console are condensed, providing a distribution of ratings and a breakdown of different aspects about the product (e.g. picture/video or battery). The user can then decide to read further, by clicking on a specific aspect, or on a specific rating. Other popular on-line services offer similar

Maybe this is not a big issue when the value of the product is just a few pounds/dollas/euros, but the same problem will arise any time there is just too much to read, and not enough time.

Application Scenarios

As mentioned in the previous paragraph, every scenario where there is a lot of text to read can be a good application scenario for text summarisation. The following list is paraphrased from a tutorial given at ACL 2001 by Maybury and Many:

News summaries: what happened while I was away?
Physicians’ aids: summarise and compare the recommended treatments for this patient
Meeting summarisation: what happended at the meeting I missed?
Search engine result pages: snippets of the retrieved documents compared to the query
Intelligence: create 500-word file of a suspect
Small screens: create a screen-sized summary of a book/email
Aids for visually impaired: compact the text and read it out for a blind person

More examples:

Sentiment Analysis: give me pros and cons of a product
Social media: what are the trending topics today?

Text summarisation is not the only way to tackle the information overload in some of these scenarios, but it can play an important role and it can be used as a component of a more complex system that involves e.g. recommendations and search.

Properties of a Summary

Before we can build a summarisation system, we need to understand how the summary is going to be consumed.

There are many different ways to characterise a summary, here we summarise some of them.

Abstract vs. extract: do we rephrase the content or do we extract some of it? The first involves natural language generation, the latter involves e.g. phrase/sentence ranking.

Single vs Multi source: multiple sources can introduce discrepancies, confirming or contradicting some information. Reviews stating opposite opinions and experiences can be legitimate. News releases that contradict each others are problematic. How do we deal with duplicate content? How do we promote novelty?

Generic vs User-oriented: a generic summary is static, created once for all the users. A user-oriented summary is dynamic, tailored to a particular user profile or user session.

Summary Function: do we want to cover all the key points of the source, or just act as a preview? Do we provide an additional critical view on the source? Think about a movie, and compare its plot, its trailer and a review about it: they are all summaries of the movie, but they have different functions.

Summary Length: a summary should be… short. How short? Do we have a target length (number of words/sentences) or a compression rate (e.g. 5% of the source)?

Linguistic qualities: is the summary coherent? Is it fluent? Is it self-contained?

These are just some of the aspects to consider when building a summarisation application. The key question probably is: how is it going to help the user?

Evaluation: How Good is a Summary?

Evaluating a summary is a challenge in itself. The previous paragraph has opened the discussion for a variety of summarisation approaches, so in order to decide how good a summary is, we really need to put some more context.

In principle, there two orthogonal ways to evaluate summarisation: user-vs-system-based and intrinsic-vs-extrinsic. Let’s briefly discuss them.

User-based intrinsic: users are asked to judge the quality of the summary per se. A typical question could be as simple as “how did you like the summary?“, or something more complex regarding the coherence of the summary, or whether it was helpful to understand the full text.

User-based extrinsic: users are asked to complete a particular task. The quality of the summary is measured against how well the user performs on the task. Here, “how well” can involve, for example, accuracy or speed: does the summary improve the user’s performance?

System-based intrinsic: gold standard summaries are produced by human judges, and the system-generated summaries are compared against them. Some evaluation metrics are involved for the automatic generation of a score that allows summarisation systems to be compared. A common example is the ROUGE framework, based on n-gram overlaps.

System-based extrinsic: the system performs some other tasks (e.g. text classification), using the system-generated summary. The system performances are evaluated for these other tasks, with and without the use of the summarisation component.

In general, involving users is a longer and more expensive process, but can provide interesting insights in terms of summary quality. System-based summarisation with e.g. ROUGE can be useful for some initial comparison, when many potential system candidates are available and employing users to judge all of them could be simply not feasible.

Evaluating a summariser against a particular task (extrinsic evaluation) often helps to answer the initial question, how is the summary going to help the user?

Conclusions

To summarise :) there are a few aspects to consider before building a summarisation system.

This article has provided an overall introduction to the field, to highlight some of the key issues to think about.

Some follow-up article will provide more concrete examples with existing tools and actual implementation, to showcase the real use of text summarisation.

Category: Text Mining

A Brief Introduction to Text Summarisation