Building a search-as-you-type feature with Elasticsearch, AngularJS and Flask (Part 2: front-end)

This article is the second part of a tutorial which describes how to build a search-as-you-type feature based on Elasticsearch, Python/Flask and AngularJS.

The first part has discussed how to set-up Elasticsearch and a microservice in Python/Flask, i.e. the back-end part of the system. It also provided an overall view on the architecture. In this second part, we’ll discuss details about the front-end, based on AngularJS.

The full code is available at https://github.com/bonzanini/CheerMeApp-demo.

Single-Page App

The front-end is a single-page application which uses AngularJS, as well as Bootstrap for styling.

Firstly, we create an index.html page, declaring the HTML document as an AngularJS app with the ng-app attribute:

<html ng-app="myApp">

In the head declarations, we’ll need to include AngularJS itself as well as some of its components (we’re using angular-route and angular-resource), the Bootstrap stylesheet and the custom app code, e.g.

<head>
    <!-- Load AngularJS -->
    <script src="https://code.angularjs.org/1.4.3/angular.min.js"></script>
    <script src="https://code.angularjs.org/1.4.3/angular-route.min.js"></script>
    <script src="https://code.angularjs.org/1.4.3/angular-resource.min.js"></script>
    <!-- Load Bootstrap CSS-->
    <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css">
    <!-- Load custom app code -->
    <script type="text/javascript" src="app.js"></script>
</head>

The whole user interface resides in a <div> container as ng-view:

<body>
    <!-- The UI -->
    <div class="container">
        <div ng-view></div>
    </div>
</body>

Both ng-app and ng-view are directives defined by AngularJS, i.e. Angular extends HTML using attributes with a “ng-” prefix.

The core of the front-end

The main front-end component is the AngularJS app defined in app.js. The starting point is the module definition:

var myApp = angular.module("myApp", ["ngRoute", "ngResource", "myApp.services"]);

The myApp application has some dependencies, namely ngRoute for routing (e.g. for templates), ngResource to access external RESTful resources, and the custom myApp.services which defines the access such resources.

If you remember from the previous article, we have a microservice based on Python/Flask listening on localhost:5000, which provides access to a REST API. The myApp.services variable is what binds such API to our AngularJS app defining the way we access the resource, e.g.

// services definition
var services = angular.module("myApp.services", ["ngResource"]);

// create specific resources, defining the related URLs and how to access them
services
.factory('Beer', function($resource) {
    return $resource('http://localhost:5000/api/v1/beers/:id', {id: '@id'}, {
        get: { method: 'GET' },
        delete: { method: 'DELETE' }
    });
})
.factory('Beers', function($resource) {
    return $resource('http://localhost:5000/api/v1/beers', {}, {
        query: { method: 'GET', isArray: true },
        create: { method: 'POST', }
    });
})
.factory('Search', function($resource) {
    return $resource('http://localhost:5000/api/v1/search', {q: '@q'}, {
        query: { method: 'GET', isArray: true}
    });
});

Once the resources are defined, we can define the rules for routing/templating, e.g.

myApp.config(function($routeProvider) {
    $routeProvider
    .when('/', {
        templateUrl: 'pages/main.html',
        controller: 'mainController'
    })
    .when('/newBeer', {
        templateUrl: 'pages/beer_new.html',
        controller: 'newBeerController'
    })
    .when('/beers', {
        templateUrl: 'pages/beers.html',
        controller: 'beerListController'
    })
    .when('/beers/:id', {
        templateUrl: 'pages/beer_details.html',
        controller: 'beerDetailsController'
    })
});

The $routeProvider simply allows to associate a matching URL with a template (a HTML page) and a controller (a function that, among other aspects, binds data with the template).

For example the controller of the entry page can be defined as:

myApp.controller(
    'mainController',
    function ($scope, Search) {
        $scope.search = function() {
            q = $scope.searchString;
            if (q.length > 1) {
                $scope.results = Search.query({q: q});    
            }
        };
    }
);

In the controller, there are three references to the scope, namely searchString, results and search. The first one is the content of the input field used for search, i.e.

<input type="text" class="form-control" ng-model="searchString" placeholder='Search: e.g. "light beer" or "London"' ng-change="search()"/>

while the second one is the list of results, in form of table rows, i.e.

<tr ng-repeat="result in results">
    <td><a href="#/beers/{{result.id}}">{{ result.name }}</a></td>
    <td>{{ result.producer }}</td>
</tr>

The third reference is a function, search(), defined in the controller itself, and invoked by the UI whenever the text in the input field is changed. The function checks if the text has at least two characters, and then sends it as a query to the Search resource declared at the beginning in the services var (i.e. as part of the REST API). If the search provides results (a list of beers along with their producers), these are shown are table rows.

The two HTML definitions above are part of the pages/main.html template described above and linked to the mainController().

Other controllers are defined in a similar fashion, and they all define the behavior of a specific view, just with a few lines of Javascript.

Summary

Using AngularJS and Bootstrap, we have quickly created a simple and clean UI for our search-as-you-type system. As the access to the data happens through the microservice defined in the previous article, we have defined the access to the REST API as ngResource.

Each view in the UI is defined as a template, i.e. a HTML page. The behaviour of the UI and the data binding is defined in the controllers.

All in all, with a relatively small amount of Javascript code, AngularJS allows to build an interactive UI which can access REST resources.

Links:

@MarcoBonzanini

Building a Search-As-You-Type Feature with Elasticsearch, AngularJS and Flask

Search-as-you-type is an interesting feature of modern search engines, that allows users to have an instant feedback related to their search, while they are still typing a query.

In this tutorial, we discuss how to implement this feature in a custom search engine built with Elasticsearch and Python/Flask on the backend side, and AngularJS for the frontend.

The full code is available at https://github.com/bonzanini/CheerMeApp-demo. If you go through the code, have a look at the readme file first, in particular to understand the limitations of the code.

This first part describes the details of the backend, i.e. Elasticsearch and Python/Flask.

Update: the second part of this tutorial has been published and it discusses the front-end in AngularJS.

Overall Architecture

As this demo was prototyped during International Beer Day 2015, we’ll build a small database of beers, each of which will be defined by a name, the name of its producer, a list of beer styles and a textual description. The idea is to make all these data available for search in one coherent interface, so you can just type in the name of your favourite brew, or aspects like “light and fruity”.

Our system is made up of three components:

  • Elasticsearch: used as data storage and for its search capabilities.
  • Python/Flask Microservice: the backend component that has access to Elasticsearch and provides a RESTful API for the frontend.
  • AngularJS UI: the frontend that requests data to the backend microservice.

There are two types of documents – beers and styles. While styles are simple strings with the style name, beers are more complex. This is an example:

{
    "name": "Raspberry Wheat Beer", 
    "styles": ["Wheat Ale", "Fruit Beer"], 
    "abv": 5.0, 
    "producer": "Meantime Brewing London", 
    "description": "Based on a pale, lightly hopped wheat beer, the refreshingly crisp fruitiness, aroma and rich colour come from the addition of fresh raspberry puree during maturation."
}

(the description is taken from the producer’s website in August 2015).

Setting Up Elasticsearch

The mapping for the Elasticsearch types is fairly straightforward. The key detail in order to enable the search-as-you-type feature is how to perform partial matching over strings.

One option is to use wildcard queries over not_analyzed fields, similar to a ... WHERE field LIKE '%foobar%' query in SQL, but this is usually too expensive. Another option is to change the analysis chain in order to index also partial strings: this will result in a bigger index but in faster queries.

We can achieve our goal by using the edge_ngram filter as part of a custom analyser, e.g.:

{
    "settings": {
        "number_of_shards" : 1,
        "number_of_replicas": 0,
        "analysis": {
            "filter": {
                "autocomplete_filter": {
                    "type":     "edge_ngram",
                    "min_gram": 2,
                    "max_gram": 15
                }
            },
            "analyzer": {
                "autocomplete": {
                    "type":      "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "autocomplete_filter" 
                    ]
                }
            }
        }
    }
}

In this example, the custom filter will allow to index substrings of 2-to-15 characters. You can customise these boundaries, but indexing unigram (min_gram: 1) probably will cause any query to match any document, and words longer than 15 chars are rarely observed (e.g. we’re not dealing with long compounds).

Once the custom analysis chain is defined, the mapping is easy:

{    
    "mappings": {
        "beers": {
            "properties": {
                "name": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"},
                "styles": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"},
                "abv": {"type": "float"},
                "producer": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"},
                "description": {"type": "string", "index_analyzer": "autocomplete", "search_analyzer": "standard"}
            }
        },
        "styles": {
            "properties": {
                "name": {"type": "string", "index": "not_analyzed"}
            }
        }
    }
}

Populating Elasticsearch

Assuming you have Elasticsearch up-and-running locally on localhost:9200 (the default), you can simply type make index from the demo folder.

This will firstly try to delete an index called cheermeapp (you’ll see a missing index error the first time, as there is of course no index yet). Secondly, the index is recreated by pushing the mapping file to Elasticsearch, and finally some data are indexed using the _bulk API.

If you want to see some data, you can now type:

curl -XPOST http://localhost:9200/cheermeapp/beers/_search?pretty -d '{"query": {"match_all": {}}}'

A Python Microservice with Flask

As the Elasticsearch service is by default open to any connection, it is common practice to put it behind a custom web-service. Luckily, Flask and its Flask-RESTful extension allow use to quickly set up a RESTful microservice which exposes some useful endpoints. These endpoints will then be queries by the frontend.

If you’re following the code from the repo, the recommendation is to set-up a local virtualenv as described in the readme, in order to install the dependencies locally. You can see the full code for the backend microservice is the backend folder.

In particular, in backend/__init__.py we declare the Flask application as:

from flask import Flask
from flask_restful import reqparse, Resource, Api
from flask.ext.cors import CORS
from . import config
import requests
import json

app = Flask(__name__)
CORS(app) # required for Cross-origin Request Sharing
api = Api(app)

By setting up the backend app as Python package (a folder with an __init__.py file), the script to run this app is extremely simple:

# runbackend.py
from backend import app

if __name__ == '__main__':
    app.run(debug=True)

This code just sets up an empty web-service: we need to implement the endpoints and the related resources. One nice aspect of Flask-RESTful is that it allows to define the resources as Python classes, adding the endpoints with minimal effort.

For example, in backend/__init__.py we can continue defining the following:

class Beer(Resource):

    def get(self, beer_id):
        # the base URL for a "beers" object in Elasticsearch, e.g.
        # http://localhost:9200/cheermeapp/beers/<beer_id>
        url = config.es_base_url['beers']+'/'+beer_id
        # query Elasticsearch
        resp = requests.get(url)
        data = resp.json()
        # Return the full Elasticsearch object as a result
        beer = data['_source']
        return beer

    def delete(self, beer_id):
        # same as above
        url = config.es_base_url['beers']+'/'+beer_id
        # Query Elasticsearch
        resp = requests.delete(url)
        # return the response
        data = resp.json()
        return data
# The API URLs all start with /api/v1, in case we need to implement different versions later
api.add_resource(Beer, config.api_base_url+'/beers/<beer_id>')

class BeerList(Resource):

    def get(self):
        # same as above
        url = config.es_base_url['beers']+'/_search'
        # we retrieve all the beers (well, at least the first 100)
        # Limitation: pagination to be implemented
        query = {
            "query": {
                "match_all": {}
            },
            "size": 100
        }
        # query Elasticsearch
        resp = requests.post(url, data=json.dumps(query))
        data = resp.json()
        # build an array of results and return it
        beers = []
        for hit in data['hits']['hits']:
            beer = hit['_source']
            beer['id'] = hit['_id']
            beers.append(beer)
        return beers
api.add_resource(BeerList, config.api_base_url+'/beers')

The above code implements the GET and DELETE methods for /api/v1/beers/, which respectively retrieve and delete a specific beer, and the GET method for the /api/v1/beers, which retrieve the full list of beers. In the repo, you can also observe the POST method implemented on the BeerList class, which allows to create a new beer.

Design note: given that create-read-update operations, as well as the search, will work on the same data model, it’s probably more sensible to de-couple the object model from the endpoint definition, e.g. by defining a BeerModel and call it from the related resources.

From the repo, you can also see the implementation of the /api/v1/styles endpoint.

One the backend is running, the service will be accessible at localhost:5000 (the default option for Flask). You can test it with:

curl -XGET http://localhost:5000/api/v1/beers

The Search Functionality

Besides serving “items”, our microservice also incorporates a search functionality:

class Search(Resource):

    def get(self):
        # parse the query: ?q=[something]
        parser.add_argument('q')
        query_string = parser.parse_args()
        # base search URL
        url = config.es_base_url['beers']+'/_search'
        # Query Elasticsearch
        query = {
            "query": {
                "multi_match": {
                    "fields": ["name", "producer", "description", "styles"],
                    "query": query_string['q'],
                    "type": "cross_fields",
                    "use_dis_max": False
                }
            },
            "size": 100
        }
        resp = requests.post(url, data=json.dumps(query))
        data = resp.json()
        # Build an array of results
        beers = []
        for hit in data['hits']['hits']:
            beer = hit['_source']
            beer['id'] = hit['_id']
            beers.append(beer)
        return beers
api.add_resource(Search, config.api_base_url+'/search')

The above code will make a /api/v1/search endpoint available for custom queries.

The interface with Elasticsearch is a custom multi_match and cross_fields query, which searches over the name, producer, styles and description fields, i.e. all the textual fields.

By default, Elasticsearch performs multi_match queries as best_fields, which means only the field with the best score will give the overall score for a particular document. In our case, we prefer to have all the fields to contribute to the final score. In particular, we want to avoid longer fields like the description to be penalised by the document length normalisation.

Design note: notice how we’re duplicating the same code at the end of Search.get() and BeerList.get(), we should really decouple this.

You can test the search service with:

curl -XGET http://localhost:5000/api/v1/search?q=lon
# will retrieve all the beers matching "lon", e.g. containing the string "london"

The next step is to create the frontend to query the microservice and show the results in a nice UI. The implementation is already available in the repo, and will be discussed in the next article.

Summary

This article sets up the backend side of a search-as-you-type application.

The scenario is the CheerMeApp application, a mini database of beers with names, styles and descriptions. The search application can match any of these fields while the user is still typing, i.e. with partial string matching.

The backend side of the app is based on Elasticsearch for the data storage and search functionality. In particular, by indexing the substrings (n-grams) we allow for partial string matching, by increasing the size of the index on disk without hurting query-time performances.

The data storage is “hidden” behind a Python/Flask microservice, which provides endpoint for a client to query. In particular, we have seen how the Flask-RESTful extension allows to quickly create RESTful applications by simply declaring the resources as Python classes.

The next article will discuss some aspects of the frontend, developed in AngularJS, and how to link it with the backend.

Tuning Relevance in Elasticsearch with Custom Boosting

Elasticsearch offers different options out of the box in terms of ranking function (similarity function, in Lucene terminology). The default ranking function is a variation of TF-IDF, relatively simple to understand and, thanks to some smart normalisations, also quite effective in practice.

Each use case is a different story so sometimes the default ranking function doesn’t works as well as it does for the general case. In particular, when the collection starts being fairly diverse in terms of document structure (different document types, with different fields, with different size, etc.), we need to adjust the default behaviour so that we can tune relevance.

Static Boosting

The first option we discuss is static boosting, or in other words boosting at indexing time. The boost factor is calculated depending on the nature of the document.

For example, if we have different types of documents in the index, some type could be given more importance because of a specific business rule, or just to compensate for different morphology, as short documents are promoted by the default ranking, so on the other side long documents are penalised.

A different example is the need to apply some sort of popularity/authority score, such as PageRank or similar, that takes into account how a document is linked to other documents.

Once we have a clear formula for the boost factor, we can store its value as an extra field in the document, and use a function_score query to incorporate it into the relevance score, e.g.:

{
    "query": {
        "function_score": {
            "query": {  
                "match": {
                    "some_field_name": "query terms here"
                }
            },
            "functions": [{
                "field_value_factor": { 
                    "field": "my_boost_field"
                }
            }],
            "score_mode": "multiply"
        }
    }
}

In this example, the boost factor is stored in my_boost_field: its value and the relevance score coming from the similarity function are multiplied to achieve the final score used for the ranking.

Because of the nature of this approach, if we want to give a different boosting to the document, the document must be re-indexed in order to apply. In other words, we shouldn’t use this approach if the business rules that define the boosting are likely to change, or in general if we know that over time the relative importance of a document will change (and its boost factor with it).

Dynamic boosting

A different approach consists in the possibility of boosting the documents at query time.

In a previous post – How to Promote Recent Articles in Elasticsearch – we discussed an example of dynamic boosting used to promote recently published documents using a so-called decay function. In this case, we used the value of a date field to calculate the relative importance of the documents.

When our query is a combination of different sub-queries, we can assign different weights to the different components, for example:

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "some query terms here",
              "boost": 3
            }
          }
        },
        {
          "match": { 
            "content": "other query terms there"
          }
        }
      ]
    }
  }
}

This is a boolean OR query with two clauses, one over the field title and one over the field content. The clauses can be any type of query, and the boosting factor set to 3 means the the first clause is three times more important than the second clause.

It is worth noticing that because of the length normalisation, short fields are implicitely boosted by the default similarity, and the title is usually shorter than the whole content (similar consideration for subject vs body of a message, etc.).

When the query is a multi_match one, we can plug the boost factor in-line, for example:

{
  "multi_match" : {
    "query": "some query terms here", 
    "fields": [ "title^3", "content" ] 
  }
}

In this case, the same match query is execture over the two different fields, and using the caret syntax we can specify, like in the previous example, that the title is three times more important than the content.

Sensible values for this kind of boost factors are usually in the range 1-10, or 1-15 max, as the Elasticsearch manual suggests. This is because some boost normalisation happens internally, so bigger values will not have much impact.

Final Thoughts

Improving relevance is a crucial part of a search application, and often the final tuning is a matter of try-and-see. Boosting – in particular dynamic boosting – allows some level of control over relevance tuning, in a way that is easy to experiment with, and easy to change in order to adapt for specific business rules. With dynamic boosting, there is no need to re-index, so this is an interesting approach for relevance tuning.

Elasticsearch offers a boosting option for nearly every type of query, and on top of this, a function_score can also be employed to further personalise the scoring function.

@MarcoBonzanini

How to Promote Recent Articles in Elasticsearch

The default ranking options in Elasticsearch (and Lucene) are purely based on content. Sometimes, it is useful to mix other factors in, such as the location/distance of a restaurant, to promote venues close to the user’s location, or the publication date of an article, to promote recent articles.

This post discusses a simple solution which is already integrated in Elasticsearch and can be enabled at query time: a decay function.

Default Ranking in Elasticsearch

The default ranking function (called similarity in Lucene terminology) is a normalised version of TF-IDF. The normalisation is based on document length and promotes shorter documents over longer ones. Informally, the idea behind this approach is that longer documents will have higher TF’s just because they have more content, so something should be done to avoid penalising short documents. In practice, this works well most of the time, but in case this normalised TF-IDF is not what you need, there are plenty of other options, including BM25, Divergence From Randomness or Language Model. All these ranking functions are based on the content of a field, i.e. they are based on term statistics.

To showcase how the default ranking works, and how we can affect it later, I’ll use a sample database of articles I’ve published on this blog (click here for the Gist). The Gist will create an index on a local instance of Elasticsearch, where each document will have the fields title (string) and published (date).

Once the index is created, we can run a very basic query:

{
    "query": {
        "match": {
            "title": "python"
        }
    }
}

The query will look for the term python in the title field, and you should see how the documents ranking higher are the ones with shorter titles. This is because the term python occurs only once in each title, so what makes the difference in terms of scoring is the document length normalisation.

Function Score and Decay Functions

Elasticsearch provides an extensive support for custom scoring via the query DSL, meaning that relevance can be tweaked at query time without re-indexing. Using a function_score query, we can smooth the default scoring function to include freshness (a.k.a. recency) as a component for relevance.

Specifically, the use of some scoring functions called decay functions (namely gaussian, exponential and linear) is the key to integrate numeric fields, date fields and geo fields in the picture.

As an example, let’s consider the following query:

{
    "query": {
        "function_score": {
            "query": {
                "match": { "title": "python"}
            },
            "gauss": {
                "published": {
                    "scale":  "8w"
                }
            }
        }
    }
}

This query produces the same result set as the previous one, but the ranking will be very different. In particular, recent articles will be pushed on top. The scale parameter is what governs how quickly the score drops when moving away from the origin. In this example, we have used a range of 8 weeks, meaning that articles older than that will have a very low score.

Summary

Content is the key aspect to consider in most ranking problems which involve textual data. Sometimes though, content is not the only aspect to consider. This article has showcased decay functions as a simple Elasticsearch option to influence ranking by including freshness/recency as a core component to be mixed with content. Since decay functions work on numeric fields, date fields and geo-points, they can be used to integrate location/distance, recency and other numerical aspects like popularity (e.g. number of votes) into the ranking function. This can be done at query time, without the need to re-index.

Phrase Match and Proximity Search in Elasticsearch

The case of multi-term queries in Elasticsearch offers some room for discussion, because there are several options to consider depending on the specific use case we’re dealing with.

Multi-term queries are, in their most generic definition, queries with several terms. These terms could be completely unrelated, or they could be about the same topic, or they could even be part of a single specific concept. All these scenarios call for different configurations. This articles discusses some of the options with Elasticsearch.

Sample Documents

Assuming elasticsearch is up and running on your local machine, you can download the script which creates the data set used in the following examples. These are the created documents:

Doc ID Content
1 This is a brown fox
2 This is a brown dog
3 This dog is really brown
4 The dog is brown but this document is very very long
5 There is also a white cat
6 The quick brown fox jumps over the lazy dog

Notice that for the sake of these examples, we’re using the default configuration which means using the default TF-IDF scoring function from Lucene, that include some score normalisation based on the document length (shorter documents are promoted).

In order to run all the following queries, the basic option is to use curl, e.g.:

curl -XPOST http://localhost:9200/test/articles/_search?pretty=true -d '{THE QUERY CODE HERE}'

although one could embed the query in Python code as discussed in a previous post (or in any programming language that allows to do some REST calls). With the pretty=true parameter, the JSON output will be more readable on the shell.

A General Purpose Query

The first example of query is for the simple case of terms which may or may not be related. In this scenario, we decide to use a classic match query. This kind of query does not impose any restriction between multiple terms, but of course will promote documents which contain more query terms.

{
    "query": {
        "match": {
            "content": {
                "query": "quick brown dog"
            }
        }
     }
}

This query retrieves 5 documents, in this order:

Pos Doc ID Content Score
1 6 The quick brown fox jumps over the lazy dog 0.81502354
2 2 This is a brown dog 0.26816052
3 3 This dog is really brown 0.26816052
4 4 The dog is brown but this document is very very long 0.15323459
5 1 This is a brown fox 0.055916067

We can notice how the first document has a much higher score, because it’s the only one containing all the query terms. The documents in position 2 and 3 share the same score, because they have the same number of matches (two terms) and the same document lenght. The document in fourth position instead, despite having the same number of matches as the previous two, has a lower score because it’s much longer, so the document length normalisation penalises it. We can also notice how the last document has a very low score and it’s probably irrelevant.

Precision on multi-term query can be controlled by specifying some arbitrary threashold for the number of terms which should be matched. For example, we can re-write the query as:

{
    "query": {
        "match": {
            "content": {
                "query": "quick brown dog",
                "minimum_should_match": 75%
            }
        }
     }
}

The output will be basically the same, with the exeption of having only the top four documents. This is because the fifth document, “This is a brown fox”, only matches 1/3 of the query terms, which is below 75%. You can experiment with different thresholds for minimum match, keeping in mind that there is a balance to find between removing unrelevant documents and not losing the relevant ones.

The Case of Phrase Matching

In the previous example, the query terms were completely unrelated, so the query “quick brown dog” also retrieved brown foxes and non-quick dogs. What if we need an exact match of the query? More precisely, what if we need to match all the query terms in their relative position? This is the case for named entities like “New York”, where the two terms individually don’t convey the same meaning as the two of them concatenated in this order.

Elasticsearch has an option for this: match_phrase. The previous query can be rewritten as:

{
    "query": {
        "match_phrase": {
            "content": "quick brown dog"
        }
    }
}

We immediately see that the query returns an empty result set: there is no document about quick brown dogs. Let’s re-write the query in a less restrictive way, dropping the “quick” term:

{
    "query": {
        "match_phrase": {
            "content": "brown dog"
        }
    }
}

Now we can see how the query retrieves only one document, precisely document 2, the only one to match the exact phrase.

Proximity Search

Sometimes a phrase match can be too restrictive. What if we’re not really interested in a precise match, but we’d rather retrieve documents where the query terms occur somehow close to each other. This is an example of proximity search: the order of the terms doesn’t really matter, as long as they occur somehow within the same context. This concept is less restrictive than a pure phrase match, but still stronger than a general purpose query.

In order to achieve proximity search, we simply need to define the search window, so how far we allow the terms to be. This is called slop in Elasticsearch/Lucene terminology. The change to the previous code is really minimal, for example for a slop/window of 3 terms:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 3
            }
        }
    }
}

The result of the query:

Pos Doc ID Content Score
1 2 This is a brown dog 0.9547657
2 4 The dog is brown but this document is very very long 0.2727902

We immediately see that the second document is also relevant to the query, but it was missed by the original phrase match. We can also try with a bigger slop, e.g.:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 4
            }
        }
    }
}

which retrieves the following:

Pos Doc ID Content Score
1 2 This is a brown dog 0.9547657
2 3 This dog is really brown 0.4269842
3 4 The dog is brown but this document is very very long 0.2727902

The new result, document 3, is also still relevant. On the other side, if we keep increasing the slop, at some point we will end up including non-relevant, or less relevant, results. It’s hence important to understand the needs of the specific scenario and find a balance between not missing relevant results and including non-relevant results.

Within-Sentence Proximity Search

A variation of the proximity search discussed above consists in the need to match terms occurring in a specific context. Such a context could be the same sentence, the same paragraph, the same section, etc. The difference with what we already discussed in the previous paragraph is that here we might have a specific structure (sections, sentences, …) but not a specific window/slop size in mind.

Let’s assume the “content” field of our documents is a list of sentences, so we want to perform proximity search within a sentence. An example of document with two sentences:

{
    "content": ["This is a brown fox", "This is white dog"]
}

The trick to allow within-sentence search is to define a slop which is big enough to capture the sentence length, and to use it as a position offset:

{
    "properties": {
        "content": {
            "type": "string",
            "position_offset_gap": 100
        }
    }
}

Here the value 100 is arbitrary and is “big enough”. Pushing this configuration to our index, we force the terms to jump 100 positions ahead when there is a new sentence. In the previous document, if the term “fox” is in position 5, the following term “This” will be in position 106 rather than 6, because it’s in a new sentence. You can download the full script to implement sentence-based proximity search, with the updated documents to reflect the sentence structure, keeping in mind that applying this option to an existing data set requires re-indexing.

The value of the position offset can now be used as slop value:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 100
            }
        }
    }
}

This query will return four documents. More precisely, document 1, which mentions a white dog and a brown fox, will not be retrieved, because the two terms appear in different sentences.

Summary

We have explored some options from Elasticsearch to improve the results for queries with multiple terms. We have seen how Elasticsearch provides these functionality in a fairly easy way. The starting point is to understand the specific use case that we’re trying to tackle, and from here we have a set of choices. Depending on the scenario, we might want to choose one between:

  • a simple match search
  • a match search with a minimum match ratio
  • a phrase-based match
  • a phrase match with a slop for proximity search
  • a phrase match with a slop which matches the position offset specified in the index, for sentence-based (or any other context-based) proximity search

How to Query Elasticsearch with Python

Elasticsearch is an open-source distributed search server built on top of Apache Lucene. It’s a great tool that allows to quickly build applications with full-text search capabilities. The core implementation is in Java, but it provides a nice REST interface which allows to interact with Elasticsearch from any programming language.

This article provides an overview on how to query Elasticsearch from Python. There are two main options:

  • Implement the REST-API calls to Elasticsearch
  • Use one of the Python libraries that does the above for you

Quick Intro on Elasticsearch

Elasticsearch is developed in Java on top of Lucene, but the format for configuring the index and querying the server is JSON. Once the server is running, by default it’s accessible at localhost:9200 and we can start sending our commands via e.g. curl:

curl -XPOST http://localhost:9200/test/articles/1 -d '{
    "content": "The quick brown fox"
}'

This commands creates a new document, and since the index didn’t exist, it also creates the index. Specifically, the format for the URL is:

http://hostname:port/index_name/doc_type/doc_id

so we have just created an index “test” which contains documents of type “articles”. The document has only one field, “content”. Since we didn’t specify, the content is indexed using the default Lucene analyzer (which is usually a good choice for standard English). The document id is optional and if we don’t explicitly give one, the server will create a random hash-like one.

We can insert a few more documents, see for example the file create_index.sh from the code snippets on github.

Once the documents are indexed, we can perform a simple search, e.g.:

curl -XPOST http://localhost:9200/test/articles/_search?pretty=true -d '{
    "query": {
        "match": {
            "content": "dog"
        }
    }
}'

Using the sample documents above, this query should return only one document. Performing the same query over the term “fox” rather than “dog” should give instead four documents, ranked according to their relevance.

How the Elasticsearch/Lucene ranking function works, and all the countless configuration options for Elasticsearch, are not the focus of this article, so bear with me if we’re not digging into the details. For the moment, we’ll just focus on how to integrate/query Elasticsearch from our Python application.

Querying Elasticsearch via REST in Python

One of the option for querying Elasticsearch from Python is to create the REST calls for the search API and process the results afterwards. The requests library is particularly easy to use for this purpose. We can install it with:

pip install requests

The sample query used in the previous section can be easily embedded in a function:

def search(uri, term):
    """Simple Elasticsearch Query"""
    query = json.dumps({
        "query": {
            "match": {
                "content": term
            }
        }
    })
    response = requests.get(uri, data=query)
    results = json.loads(response.text)
    return results

The “results” variable will be a dictionary loaded from the JSON response. We can pretty-print the JSON, to observe the full output and understand all the information it provides, but again this is beyond the scope of this post. So we can simply print the results nicely, one document per line, as follows:

def format_results(results):
    """Print results nicely:
    doc_id) content
    """
    data = [doc for doc in results['hits']['hits']]
    for doc in data:
        print("%s) %s" % (doc['_id'], doc['_source']['content']))

Similarly, we can create new documents:

def create_doc(uri, doc_data={}):
    """Create new document."""
    query = json.dumps(doc_data)
    response = requests.post(uri, data=query)
    print(response)

with the doc_data variable being a (Python) dictionary which resembles the structure of the document we’re creating.

You can see a full working toy example in the rest.py file in the Gist on github.

Querying Elasticsearch Using elasticsearch-py

The requests library is fairly easy to use, but there are several options in terms of libraries that abstract away the concepts related to the REST API and focus on Elasticsearch concepts. In particular, the official Python extension for Elasticsearch, called elasticsearch-py, can be installed with:

pip install elasticsearch

It’s fairly low-level compared to other client libraries with similar capabilities, but it provides a consistent and easy to extend API.

We can replicate the search used with the requests library, as well as the result print-out, just using a few lines of Python:

from elasticsearch import Elasticsearch

es = Elasticsearch()
res = es.search(index="test", doc_type="articles", body={"query": {"match": {"content": "fox"}}})
print("%d documents found" % res['hits']['total'])
for doc in res['hits']['hits']:
    print("%s) %s" % (doc['_id'], doc['_source']['content']))

In a similar fashion, we can re-create the functionality of adding an extra document:

es.create(index="test", doc_type="articles", body={"content": "One more fox"})

The full functionality of this client library are well described in the documentation.

Summary

This article has briefly discussed a couple of options to integrate Elasticsearch into a Python application. The key points of the discussion are:

  • We can interact with Elasticsearch using the REST API
  • The requests library is particularly useful for this purpose, and probably much cleaner and easier to use than the urllib module (part of the standard library)
  • Many other Python libraries implement an Elasticsearch client, abstracting away the concept related to the REST API and focusing on Elasticsearch concepts
  • We have seen simple examples with elasticsearch-py

The full code for the examples is available as usual in a Gist:
https://gist.github.com/bonzanini/fe2ff32116f16e3009be

Searching PubMed with Python

Update 2021-01: minor update to reflect some changes in the Pubmed API

PubMed is a search engine accessing millions of biomedical citations. Users can freely search for biomedical references. For some articles, the access to the full text paper is also open.

This post describes how you can programmatically search the PubMed database with Python, in order to integrate searching or browsing capabilities into your Python application.

There are two main options to consider:

  • Accessing the database via their public API
  • Using a package that does the above for you, e.g. Biopython

The Entrez Database a.k.a. the PubMed API

The PubMed API is called the Entrez Database. It’s a web service freely accessible, although there are some guidelines to follow (at the moment of this writing, they recommend not to post more than three requests per second).

There are in total 8 different functions, or e-utilities, which access the database in different ways. Most of the utilities will return XML data, although some of them have the option to return a more convenient JSON format.

In particular, the search API is available at the following URL:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi

If we want to search for the term fever, the URL we need is for example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=20&sort=relevance&term=fever

The query string parameters used in this example:

  • db=pubmed, to narrow the search down to the pubmed DB only
  • retmode=json, to have a JSON string in response and not an XML
  • retmax=20, to obtain 20 results
  • sort=relevance, the results are sorted by relevance and not by added date which is the default ranking option on pubmed
  • term=[your query], the URL-encoded query

This search session will provide a number of PubMed IDs (probably 20) corresponding to the top citations which match our query.

In order to get some more details about these citations, we can use the efetch utility, which takes one or more citation ID as input. At the moment, the efetch utility does not return JSON, so XML is the only option to consider.

Given a list of citation IDs, the fetch operation can be built as follows

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=ID1,ID2,...

At this point, the response will be an XML to handle with e.g. minidom or other XML library. Please notice that we can query the efetch utility for multiple documents, simply by separating them with a comma.

Overall, it’s relatively easy to create the appropriate request using libraries like urllib.request or, better, requests. The response can be parsed with the json module, or minidom in case of XML.

An even more convenient way to do the job is to use an existing library that does what we need for us. A good example is Biopython, a comprehensive package for biological computation in Python.

Searching PubMed with Biopython

You can install the Biopython package with pip:

pip install biopython

The only component we need for searching PubMed is Entrez, which we can import with:

from Bio import Entrez

We can define a function for performing the search, e.g.

def search(query):
    Entrez.email = 'your.email@example.com'
    handle = Entrez.esearch(db='pubmed',
                            sort='relevance',
                            retmax='20',
                            retmode='xml',
                            term=query)
    results = Entrez.read(handle)
    return results

The list of citation IDs will be available as results[‘IdList’].

The next step is to fetch the details for all the retrieved articles via the efetch utility:

def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'your.email@example.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results


A full example of search over the term fever:

if __name__ == '__main__':
    results = search('fever')
    id_list = results['IdList']
    papers = fetch_details(id_list)
    for i, paper in enumerate(papers['PubmedArticle']):
         print("{}) {}".format(i+1, paper['MedlineCitation']['Article']['ArticleTitle']))


Notice that the structure of the MedlineCitation dictionaries can get
really convoluted, so you can get familiar with it by doing some pretty-printing. For example after fetching the papers with the code above, you can print out the data for the first paper using the following snippet, so you can understand the structure of its record.

# Pretty print the first paper in full to observe its structure
import json
print(json.dumps(papers['PubmedArticle'][0], indent=2))

The reason for declaring your email address is to allow the NCBI to
contact you before blocking your IP, in case you’re violating the guidelines.

The Gist of the full example:

https://gist.github.com/bonzanini/5a4c39e4c02502a8451d