August 2015 – Marco Bonzanini

A Brief Introduction to Text Summarisation

In this article, I’ll discuss some aspects of text summarisation, the process of analysing a text document, or a set of documents, in order to produce a summary of its content. The overall purpose is to reduce the amount of information that a user has to digest in order to understand whether reading the whole document is relevant for its information need.

This article is a bird’s-eye view on the topic, to understand the different implications of the problem, rather than a detailed discussion on a specific implementation. The latter will be the subject of future articles.

Summarisation is one of the important tasks in text analytics and it’s an active area of (academic) research which involves mainly the Natural Language Processing and Information Retrieval communities.

Information Overload and the Need for a Good Summary

The core of the matter is the information overload we are experiencing on a daily basis. To put it simply, there is just too much information to digest, and not enough time to do it. The purpose of summarisation is to minimise the amount of information you have to go through, before you can grasp the overall concepts described in the document.

Summarisation can happen in different forms, but the key idea is to present the user with something short, yet informative.

To name just one example, let’s say you want to buy some product, and you’d like to get some opinions about such product. None of your friends owns the product, so you have a look at the on-line reviews: thousands and thousands of sentences to read. Are you going to read all of them? Do you read just some of them and hope to get the best insights?

This is how Google Shopping provides the user with a possible solution:

Example of Google Shopping Summary — Example of Summary from Google Shopping

In this image, the reviews about a popular gaming console are condensed, providing a distribution of ratings and a breakdown of different aspects about the product (e.g. picture/video or battery). The user can then decide to read further, by clicking on a specific aspect, or on a specific rating. Other popular on-line services offer similar

Maybe this is not a big issue when the value of the product is just a few pounds/dollas/euros, but the same problem will arise any time there is just too much to read, and not enough time.

Application Scenarios

As mentioned in the previous paragraph, every scenario where there is a lot of text to read can be a good application scenario for text summarisation. The following list is paraphrased from a tutorial given at ACL 2001 by Maybury and Many:

News summaries: what happened while I was away?
Physicians’ aids: summarise and compare the recommended treatments for this patient
Meeting summarisation: what happended at the meeting I missed?
Search engine result pages: snippets of the retrieved documents compared to the query
Intelligence: create 500-word file of a suspect
Small screens: create a screen-sized summary of a book/email
Aids for visually impaired: compact the text and read it out for a blind person

More examples:

Sentiment Analysis: give me pros and cons of a product
Social media: what are the trending topics today?

Text summarisation is not the only way to tackle the information overload in some of these scenarios, but it can play an important role and it can be used as a component of a more complex system that involves e.g. recommendations and search.

Properties of a Summary

Before we can build a summarisation system, we need to understand how the summary is going to be consumed.

There are many different ways to characterise a summary, here we summarise some of them.

Abstract vs. extract: do we rephrase the content or do we extract some of it? The first involves natural language generation, the latter involves e.g. phrase/sentence ranking.

Single vs Multi source: multiple sources can introduce discrepancies, confirming or contradicting some information. Reviews stating opposite opinions and experiences can be legitimate. News releases that contradict each others are problematic. How do we deal with duplicate content? How do we promote novelty?

Generic vs User-oriented: a generic summary is static, created once for all the users. A user-oriented summary is dynamic, tailored to a particular user profile or user session.

Summary Function: do we want to cover all the key points of the source, or just act as a preview? Do we provide an additional critical view on the source? Think about a movie, and compare its plot, its trailer and a review about it: they are all summaries of the movie, but they have different functions.

Summary Length: a summary should be… short. How short? Do we have a target length (number of words/sentences) or a compression rate (e.g. 5% of the source)?

Linguistic qualities: is the summary coherent? Is it fluent? Is it self-contained?

These are just some of the aspects to consider when building a summarisation application. The key question probably is: how is it going to help the user?

Evaluation: How Good is a Summary?

Evaluating a summary is a challenge in itself. The previous paragraph has opened the discussion for a variety of summarisation approaches, so in order to decide how good a summary is, we really need to put some more context.

In principle, there two orthogonal ways to evaluate summarisation: user-vs-system-based and intrinsic-vs-extrinsic. Let’s briefly discuss them.

User-based intrinsic: users are asked to judge the quality of the summary per se. A typical question could be as simple as “how did you like the summary?“, or something more complex regarding the coherence of the summary, or whether it was helpful to understand the full text.

User-based extrinsic: users are asked to complete a particular task. The quality of the summary is measured against how well the user performs on the task. Here, “how well” can involve, for example, accuracy or speed: does the summary improve the user’s performance?

System-based intrinsic: gold standard summaries are produced by human judges, and the system-generated summaries are compared against them. Some evaluation metrics are involved for the automatic generation of a score that allows summarisation systems to be compared. A common example is the ROUGE framework, based on n-gram overlaps.

System-based extrinsic: the system performs some other tasks (e.g. text classification), using the system-generated summary. The system performances are evaluated for these other tasks, with and without the use of the summarisation component.

In general, involving users is a longer and more expensive process, but can provide interesting insights in terms of summary quality. System-based summarisation with e.g. ROUGE can be useful for some initial comparison, when many potential system candidates are available and employing users to judge all of them could be simply not feasible.

Evaluating a summariser against a particular task (extrinsic evaluation) often helps to answer the initial question, how is the summary going to help the user?

Conclusions

To summarise :) there are a few aspects to consider before building a summarisation system.

This article has provided an overall introduction to the field, to highlight some of the key issues to think about.

Some follow-up article will provide more concrete examples with existing tools and actual implementation, to showcase the real use of text summarisation.

Building a search-as-you-type feature with Elasticsearch, AngularJS and Flask (Part 2: front-end)

This article is the second part of a tutorial which describes how to build a search-as-you-type feature based on Elasticsearch, Python/Flask and AngularJS.

The first part has discussed how to set-up Elasticsearch and a microservice in Python/Flask, i.e. the back-end part of the system. It also provided an overall view on the architecture. In this second part, we’ll discuss details about the front-end, based on AngularJS.

The full code is available at https://github.com/bonzanini/CheerMeApp-demo.

Single-Page App

The front-end is a single-page application which uses AngularJS, as well as Bootstrap for styling.

Firstly, we create an index.html page, declaring the HTML document as an AngularJS app with the ng-app attribute:

<html ng-app="myApp">

In the head declarations, we’ll need to include AngularJS itself as well as some of its components (we’re using angular-route and angular-resource), the Bootstrap stylesheet and the custom app code, e.g.

<head>
    <!-- Load AngularJS -->
    <script src="https://code.angularjs.org/1.4.3/angular.min.js"></script>
    <script src="https://code.angularjs.org/1.4.3/angular-route.min.js"></script>
    <script src="https://code.angularjs.org/1.4.3/angular-resource.min.js"></script>
    <!-- Load Bootstrap CSS-->
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
    <!-- Load custom app code -->
    <script type="text/javascript" src="app.js"></script>
</head>

The whole user interface resides in a <div> container as ng-view:

<body>
    <!-- The UI -->
    <div class="container">
        <div ng-view></div>
    </div>
</body>

Both ng-app and ng-view are directives defined by AngularJS, i.e. Angular extends HTML using attributes with a “ng-” prefix.

The core of the front-end

The main front-end component is the AngularJS app defined in app.js. The starting point is the module definition:

var myApp = angular.module("myApp", ["ngRoute", "ngResource", "myApp.services"]);

The myApp application has some dependencies, namely ngRoute for routing (e.g. for templates), ngResource to access external RESTful resources, and the custom myApp.services which defines the access such resources.

If you remember from the previous article, we have a microservice based on Python/Flask listening on localhost:5000, which provides access to a REST API. The myApp.services variable is what binds such API to our AngularJS app defining the way we access the resource, e.g.

// services definition
var services = angular.module("myApp.services", ["ngResource"]);

// create specific resources, defining the related URLs and how to access them
services
.factory('Beer', function($resource) {
    return $resource('http://localhost:5000/api/v1/beers/:id', {id: '@id'}, {
        get: { method: 'GET' },
        delete: { method: 'DELETE' }
    });
})
.factory('Beers', function($resource) {
    return $resource('http://localhost:5000/api/v1/beers', {}, {
        query: { method: 'GET', isArray: true },
        create: { method: 'POST', }
    });
})
.factory('Search', function($resource) {
    return $resource('http://localhost:5000/api/v1/search', {q: '@q'}, {
        query: { method: 'GET', isArray: true}
    });
});

Once the resources are defined, we can define the rules for routing/templating, e.g.

myApp.config(function($routeProvider) {
    $routeProvider
    .when('/', {
        templateUrl: 'pages/main.html',
        controller: 'mainController'
    })
    .when('/newBeer', {
        templateUrl: 'pages/beer_new.html',
        controller: 'newBeerController'
    })
    .when('/beers', {
        templateUrl: 'pages/beers.html',
        controller: 'beerListController'
    })
    .when('/beers/:id', {
        templateUrl: 'pages/beer_details.html',
        controller: 'beerDetailsController'
    })
});

The $routeProvider simply allows to associate a matching URL with a template (a HTML page) and a controller (a function that, among other aspects, binds data with the template).

For example the controller of the entry page can be defined as:

myApp.controller(
    'mainController',
    function ($scope, Search) {
        $scope.search = function() {
            q = $scope.searchString;
            if (q.length > 1) {
                $scope.results = Search.query({q: q});    
            }
        };
    }
);

In the controller, there are three references to the scope, namely searchString, results and search. The first one is the content of the input field used for search, i.e.

<input type="text" class="form-control" ng-model="searchString" placeholder='Search: e.g. "light beer" or "London"' ng-change="search()"/>

while the second one is the list of results, in form of table rows, i.e.

<tr ng-repeat="result in results">
    <td><a href="#/beers/{{result.id}}">{{ result.name }}</a></td>
    <td>{{ result.producer }}</td>
</tr>

The third reference is a function, search(), defined in the controller itself, and invoked by the UI whenever the text in the input field is changed. The function checks if the text has at least two characters, and then sends it as a query to the Search resource declared at the beginning in the services var (i.e. as part of the REST API). If the search provides results (a list of beers along with their producers), these are shown are table rows.

The two HTML definitions above are part of the pages/main.html template described above and linked to the mainController().

Other controllers are defined in a similar fashion, and they all define the behavior of a specific view, just with a few lines of Javascript.

Summary

Using AngularJS and Bootstrap, we have quickly created a simple and clean UI for our search-as-you-type system. As the access to the data happens through the microservice defined in the previous article, we have defined the access to the REST API as ngResource.

Each view in the UI is defined as a template, i.e. a HTML page. The behaviour of the UI and the data binding is defined in the controllers.

All in all, with a relatively small amount of Javascript code, AngularJS allows to build an interactive UI which can access REST resources.

Links:

@MarcoBonzanini

Building a Search-As-You-Type Feature with Elasticsearch, AngularJS and Flask

Search-as-you-type is an interesting feature of modern search engines, that allows users to have an instant feedback related to their search, while they are still typing a query.

In this tutorial, we discuss how to implement this feature in a custom search engine built with Elasticsearch and Python/Flask on the backend side, and AngularJS for the frontend.

The full code is available at https://github.com/bonzanini/CheerMeApp-demo. If you go through the code, have a look at the readme file first, in particular to understand the limitations of the code.

This first part describes the details of the backend, i.e. Elasticsearch and Python/Flask.

Update: the second part of this tutorial has been published and it discusses the front-end in AngularJS.

Overall Architecture

As this demo was prototyped during International Beer Day 2015, we’ll build a small database of beers, each of which will be defined by a name, the name of its producer, a list of beer styles and a textual description. The idea is to make all these data available for search in one coherent interface, so you can just type in the name of your favourite brew, or aspects like “light and fruity”.

Our system is made up of three components:

Elasticsearch: used as data storage and for its search capabilities.
Python/Flask Microservice: the backend component that has access to Elasticsearch and provides a RESTful API for the frontend.
AngularJS UI: the frontend that requests data to the backend microservice.

There are two types of documents – beers and styles. While styles are simple strings with the style name, beers are more complex. This is an example:

{
    "name": "Raspberry Wheat Beer", 
    "styles": ["Wheat Ale", "Fruit Beer"], 
    "abv": 5.0, 
    "producer": "Meantime Brewing London", 
    "description": "Based on a pale, lightly hopped wheat beer, the refreshingly crisp fruitiness, aroma and rich colour come from the addition of fresh raspberry puree during maturation."
}

(the description is taken from the producer’s website in August 2015).

Setting Up Elasticsearch

The mapping for the Elasticsearch types is fairly straightforward. The key detail in order to enable the search-as-you-type feature is how to perform partial matching over strings.

One option is to use wildcard queries over not_analyzed fields, similar to a ... WHERE field LIKE '%foobar%' query in SQL, but this is usually too expensive. Another option is to change the analysis chain in order to index also partial strings: this will result in a bigger index but in faster queries.

We can achieve our goal by using the edge_ngram filter as part of a custom analyser, e.g.:

{
    "settings": {
        "number_of_shards" : 1,
        "number_of_replicas": 0,
        "analysis": {
            "filter": {
                "autocomplete_filter": {
                    "type":     "edge_ngram",
                    "min_gram": 2,
                    "max_gram": 15
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

In this example, the custom filter will allow to index substrings of 2-to-15 characters. You can customise these boundaries, but indexing unigram (min_gram: 1) probably will cause any query to match any document, and words longer than 15 chars are rarely observed (e.g. we’re not dealing with long compounds).

Once the custom analysis chain is defined, the mapping is easy:

{    
    "mappings": {
        "beers": {
            "properties": {
                "name": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"},
                "styles": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"},
                "abv": {"type": "float"},
                "producer": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"},
                "description": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"}
            }
        },
        "styles": {
            "properties": {
                "name": {"type": "string", "index": "not_analyzed"}
            }
        }
    }
}

Populating Elasticsearch

Assuming you have Elasticsearch up-and-running locally on localhost:9200 (the default), you can simply type make index from the demo folder.

This will firstly try to delete an index called cheermeapp (you’ll see a missing index error the first time, as there is of course no index yet). Secondly, the index is recreated by pushing the mapping file to Elasticsearch, and finally some data are indexed using the _bulk API.

If you want to see some data, you can now type:

curl -XPOST http://localhost:9200/cheermeapp/beers/_search?pretty -d '{"query": {"match_all": {}}}'

A Python Microservice with Flask

As the Elasticsearch service is by default open to any connection, it is common practice to put it behind a custom web-service. Luckily, Flask and its Flask-RESTful extension allow use to quickly set up a RESTful microservice which exposes some useful endpoints. These endpoints will then be queries by the frontend.

If you’re following the code from the repo, the recommendation is to set-up a local virtualenv as described in the readme, in order to install the dependencies locally. You can see the full code for the backend microservice is the backend folder.

In particular, in backend/__init__.py we declare the Flask application as:

from flask import Flask
from flask_restful import reqparse, Resource, Api
from flask.ext.cors import CORS
from . import config
import requests
import json

app = Flask(__name__)
CORS(app) # required for Cross-origin Request Sharing
api = Api(app)

By setting up the backend app as Python package (a folder with an __init__.py file), the script to run this app is extremely simple:

# runbackend.py
from backend import app

if __name__ == '__main__':
    app.run(debug=True)

This code just sets up an empty web-service: we need to implement the endpoints and the related resources. One nice aspect of Flask-RESTful is that it allows to define the resources as Python classes, adding the endpoints with minimal effort.

For example, in backend/__init__.py we can continue defining the following:

class Beer(Resource):

    def get(self, beer_id):
        # the base URL for a "beers" object in Elasticsearch, e.g.
        # http://localhost:9200/cheermeapp/beers/<beer_id>
        url = config.es_base_url['beers']+'/'+beer_id
        # query Elasticsearch
        resp = requests.get(url)
        data = resp.json()
        # Return the full Elasticsearch object as a result
        beer = data['_source']
        return beer

    def delete(self, beer_id):
        # same as above
        url = config.es_base_url['beers']+'/'+beer_id
        # Query Elasticsearch
        resp = requests.delete(url)
        # return the response
        data = resp.json()
        return data
# The API URLs all start with /api/v1, in case we need to implement different versions later
api.add_resource(Beer, config.api_base_url+'/beers/<beer_id>')

class BeerList(Resource):

    def get(self):
        # same as above
        url = config.es_base_url['beers']+'/_search'
        # we retrieve all the beers (well, at least the first 100)
        # Limitation: pagination to be implemented
        query = {
            "query": {
                "match_all": {}
            },
            "size": 100
        }
        # query Elasticsearch
        resp = requests.post(url, data=json.dumps(query))
        data = resp.json()
        # build an array of results and return it
        beers = []
        for hit in data['hits']['hits']:
            beer = hit['_source']
            beer['id'] = hit['_id']
            beers.append(beer)
        return beers
api.add_resource(BeerList, config.api_base_url+'/beers')

The above code implements the GET and DELETE methods for /api/v1/beers/, which respectively retrieve and delete a specific beer, and the GET method for the /api/v1/beers, which retrieve the full list of beers. In the repo, you can also observe the POST method implemented on the BeerList class, which allows to create a new beer.

Design note: given that create-read-update operations, as well as the search, will work on the same data model, it’s probably more sensible to de-couple the object model from the endpoint definition, e.g. by defining a BeerModel and call it from the related resources.

From the repo, you can also see the implementation of the /api/v1/styles endpoint.

One the backend is running, the service will be accessible at localhost:5000 (the default option for Flask). You can test it with:

curl -XGET http://localhost:5000/api/v1/beers

The Search Functionality

Besides serving “items”, our microservice also incorporates a search functionality:

class Search(Resource):

    def get(self):
        # parse the query: ?q=[something]
        parser.add_argument('q')
        query_string = parser.parse_args()
        # base search URL
        url = config.es_base_url['beers']+'/_search'
        # Query Elasticsearch
        query = {
            "query": {
                "multi_match": {
                    "fields": ["name", "producer", "description", "styles"],
                    "query": query_string['q'],
                    "type": "cross_fields",
                    "use_dis_max": False
                }
            },
            "size": 100
        }
        resp = requests.post(url, data=json.dumps(query))
        data = resp.json()
        # Build an array of results
        beers = []
        for hit in data['hits']['hits']:
            beer = hit['_source']
            beer['id'] = hit['_id']
            beers.append(beer)
        return beers
api.add_resource(Search, config.api_base_url+'/search')

The above code will make a /api/v1/search endpoint available for custom queries.

The interface with Elasticsearch is a custom multi_match and cross_fields query, which searches over the name, producer, styles and description fields, i.e. all the textual fields.

By default, Elasticsearch performs multi_match queries as best_fields, which means only the field with the best score will give the overall score for a particular document. In our case, we prefer to have all the fields to contribute to the final score. In particular, we want to avoid longer fields like the description to be penalised by the document length normalisation.

Design note: notice how we’re duplicating the same code at the end of Search.get() and BeerList.get(), we should really decouple this.

You can test the search service with:

curl -XGET http://localhost:5000/api/v1/search?q=lon
# will retrieve all the beers matching "lon", e.g. containing the string "london"

The next step is to create the frontend to query the microservice and show the results in a nice UI. The implementation is already available in the repo, and will be discussed in the next article.

Summary

This article sets up the backend side of a search-as-you-type application.

The scenario is the CheerMeApp application, a mini database of beers with names, styles and descriptions. The search application can match any of these fields while the user is still typing, i.e. with partial string matching.

The backend side of the app is based on Elasticsearch for the data storage and search functionality. In particular, by indexing the substrings (n-grams) we allow for partial string matching, by increasing the size of the index on disk without hurting query-time performances.

The data storage is “hidden” behind a Python/Flask microservice, which provides endpoint for a client to query. In particular, we have seen how the Flask-RESTful extension allows to quickly create RESTful applications by simply declaring the resources as Python classes.

The next article will discuss some aspects of the frontend, developed in AngularJS, and how to link it with the backend.