June 2015 – Marco Bonzanini

Tuning Relevance in Elasticsearch with Custom Boosting

Elasticsearch offers different options out of the box in terms of ranking function (similarity function, in Lucene terminology). The default ranking function is a variation of TF-IDF, relatively simple to understand and, thanks to some smart normalisations, also quite effective in practice.

Each use case is a different story so sometimes the default ranking function doesn’t works as well as it does for the general case. In particular, when the collection starts being fairly diverse in terms of document structure (different document types, with different fields, with different size, etc.), we need to adjust the default behaviour so that we can tune relevance.

Static Boosting

The first option we discuss is static boosting, or in other words boosting at indexing time. The boost factor is calculated depending on the nature of the document.

For example, if we have different types of documents in the index, some type could be given more importance because of a specific business rule, or just to compensate for different morphology, as short documents are promoted by the default ranking, so on the other side long documents are penalised.

A different example is the need to apply some sort of popularity/authority score, such as PageRank or similar, that takes into account how a document is linked to other documents.

Once we have a clear formula for the boost factor, we can store its value as an extra field in the document, and use a function_score query to incorporate it into the relevance score, e.g.:

{
    "query": {
        "function_score": {
            "query": {  
                "match": {
                    "some_field_name": "query terms here"
                }
            },
            "functions": [{
                "field_value_factor": { 
                    "field": "my_boost_field"
                }
            }],
            "score_mode": "multiply"
        }
    }
}

In this example, the boost factor is stored in my_boost_field: its value and the relevance score coming from the similarity function are multiplied to achieve the final score used for the ranking.

Because of the nature of this approach, if we want to give a different boosting to the document, the document must be re-indexed in order to apply. In other words, we shouldn’t use this approach if the business rules that define the boosting are likely to change, or in general if we know that over time the relative importance of a document will change (and its boost factor with it).

Dynamic boosting

A different approach consists in the possibility of boosting the documents at query time.

In a previous post – How to Promote Recent Articles in Elasticsearch – we discussed an example of dynamic boosting used to promote recently published documents using a so-called decay function. In this case, we used the value of a date field to calculate the relative importance of the documents.

When our query is a combination of different sub-queries, we can assign different weights to the different components, for example:

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "some query terms here",
              "boost": 3
            }
          }
        },
        {
          "match": { 
            "content": "other query terms there"
          }
        }
      ]
    }
  }
}

This is a boolean OR query with two clauses, one over the field title and one over the field content. The clauses can be any type of query, and the boosting factor set to 3 means the the first clause is three times more important than the second clause.

It is worth noticing that because of the length normalisation, short fields are implicitely boosted by the default similarity, and the title is usually shorter than the whole content (similar consideration for subject vs body of a message, etc.).

When the query is a multi_match one, we can plug the boost factor in-line, for example:

{
  "multi_match" : {
    "query": "some query terms here", 
    "fields": [ "title^3", "content" ] 
  }
}

In this case, the same match query is execture over the two different fields, and using the caret syntax we can specify, like in the previous example, that the title is three times more important than the content.

Sensible values for this kind of boost factors are usually in the range 1-10, or 1-15 max, as the Elasticsearch manual suggests. This is because some boost normalisation happens internally, so bigger values will not have much impact.

Final Thoughts

Improving relevance is a crucial part of a search application, and often the final tuning is a matter of try-and-see. Boosting – in particular dynamic boosting – allows some level of control over relevance tuning, in a way that is easy to experiment with, and easy to change in order to adapt for specific business rules. With dynamic boosting, there is no need to re-index, so this is an interesting approach for relevance tuning.

Elasticsearch offers a boosting option for nearly every type of query, and on top of this, a function_score can also be employed to further personalise the scoring function.

@MarcoBonzanini

Mining Twitter Data with Python (and JS) – Part 7: Geolocation and Interactive Maps

Geolocation is the process of identifying the geographic location of an object such as a mobile phone or a computer. Twitter allows its users to provide their location when they publish a tweet, in the form of latitude and longitude coordinates. With this information, we are ready to create some nice visualisation for our data, in the form of interactive maps.

This article briefly introduces the GeoJSON format and Leaflet.js, a nice Javascript library for interactive maps, and discusses its integration with the Twitter data we have collected in the previous parts of this tutorial (see Part 4 for details on the rugby data set).

Tutorial Table of Contents:

Part 1: Collecting data
Part 2: Text Pre-processing
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps (this article)

GeoJSON

GeoJSON is a format for encoding geographic data structures. The format supports a variety of geometric types that can be used to visualise the desired shapes onto a map. For our examples, we just need the simplest structure, a Point. A point is identified by its coordinates (latitude and longitude).

In GeoJSON, we can also represent objects such as a Feature or a FeatureCollection. The first one is basically a geometry with additional properties, while the second one is a list of features.

Our Twitter data set can be represented in GeoJSON as a FeatureCollection, where each tweet would be an individual Feature with its one geometry (the aforementioned Point).

This is how the JSON structure looks like:

{
    "type": "FeatureCollection",
    "features": [
        { 
            "type": "Feature",
            "geometry": {
                "type": "Point", 
                "coordinates": [some_latitude, some_longitude]
            },
            "properties": {
                "text": "This is sample a tweet",
                "created_at": "Sat Mar 21 12:30:00 +0000 2015"
            }
        },
        /* more tweets ... */
    ]
}

From Tweets to GeoJSON

Assuming the data are stored in a single file as described in the first chapter of this tutorial, we simply need to iterate all the tweets looking for the coordinates field, which may or may not be present. Keep in mind that you need to use coordinates, because the geo field is deprecated (see the API).

This code will read the data set, looking for tweets where the coordinates are explicitely given. Once the GeoJSON data structure is created (in the form of a Python dictionary), then the data are dumped into a file called geo_data.json:

# Tweets are stored in "fname"
with open(fname, 'r') as f:
    geo_data = {
        "type": "FeatureCollection",
        "features": []
    }
    for line in f:
        tweet = json.loads(line)
        if tweet['coordinates']:
            geo_json_feature = {
                "type": "Feature",
                "geometry": tweet['coordinates'],
                "properties": {
                    "text": tweet['text'],
                    "created_at": tweet['created_at']
                }
            }
            geo_data['features'].append(geo_json_feature)

# Save geo data
with open('geo_data.json', 'w') as fout:
    fout.write(json.dumps(geo_data, indent=4))

Interactive Maps with Leaflet.js

Leaflet.js is an open-source Javascript library for interactive maps. You can create maps with tiles of your choice (e.g. from OpenStreetMap or MapBox), and overlap interactive components.

In order to prepare a web page that will host a map, you simply need to include the library and its CSS, by putting in the head section of your document the following lines:

<link rel="stylesheet" href="http://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.3/leaflet.css" />
<script src="http://cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.3/leaflet.js"></script>

Moreover, we have all our GeoJSON data in a separate file, so we want to load the data dynamically rather than manually put all the points in the map. For this purpose, we can easily play with jQuery, which we also need to include:

<script src="http://code.jquery.com/jquery-2.1.0.min.js"></script>

The map itself will be placed into a div element:

<!-- this goes in the <head> -->
<style>
#map {
    height: 600px;
}
</style>
<!-- this goes in the <body> -->
<div id="map"></div>

We’re now ready to create the map with Leaflet:

// Load the tile images from OpenStreetMap
var mytiles = L.tileLayer('http://{s}.tile.osm.org/{z}/{x}/{y}.png', {
    attribution: '&copy; <a href="http://osm.org/copyright">OpenStreetMap</a> contributors'
});
// Initialise an empty map
var map = L.map('map');
// Read the GeoJSON data with jQuery, and create a circleMarker element for each tweet
// Each tweet will be represented by a nice red dot
$.getJSON("./geo_data.json", function(data) {
    var myStyle = {
        radius: 2,
        fillColor: "red",
        color: "red",
        weight: 1,
        opacity: 1,
        fillOpacity: 1
    };

    var geojson = L.geoJson(data, {
        pointToLayer: function (feature, latlng) {
            return L.circleMarker(latlng, myStyle);
        }
    });
    geojson.addTo(map)
});
// Add the tiles to the map, and initialise the view in the middle of Europe
map.addLayer(mytiles).setView([50.5, 5.0], 5);

A screenshot of the results:

The above example uses OpenStreetMap for the tile images, but Leaflet lets you choose other services. For example, in the following screenshot the tiles are coming from MapBox.

You can see the interactive maps in action here:

Tiles from OpenStreetMap

Tiles from MapBox

Summary

In general there are many options for data visualisation in Python, but in terms of browser-based interaction, Javascript is also an interesting option, and the two languages can play well together. This article has shown that building a simple interactive map is a fairly straightforward process.

With a few lines of Python, we’ve been able to transform our data into a common format (GeoJSON) that can be passed onto Javascript for visualisation. Leaflet.js is a nice Javascript library that, almost out of the box, lets us create some nice interactive maps.

Tutorial Table of Contents:

Part 1: Collecting data
Part 2: Text Pre-processing
Part 3: Term Frequencies
Part 4: Rugby and Term Co-Occurrences
Part 5: Data Visualisation Basics
Part 6: Sentiment Analysis Basics
Part 7: Geolocation and Interactive Maps (this article)

@MarcoBonzanini

Functional Programming in Python

This is probably not the newest of the topics, but I haven’t had the chance to dig into it before, so here we go.

Python supports multiple programming paradigms, but it’s not best known for its Functional Programming style.
As its own creator has mentioned before, Python hasn’t been heavily influenced by other functional languages, but it does show some functional features. This is a very gentle introduction to Functional Programming in Python.

What is Functional Programming anyway?

Functional Programming is a programming paradigm based on the evaluation of expression, which avoids changing-state and mutable data. Popular functional languages include Lisp (and family), ML (and family), Haskell, Erlang and Clojure.

Functional Programming can be seen as a subset of Declarative Programming (as opposed to Imperative Programming).

Some of the characteristics of Functional Programming are:

Avoid state representation
Data are immutable
First class functions
High-order functions
Recursion

One of the key motivations beyond Functional Programming, and beyond some of its aspects like no changing-state and no mutable data, is the need to eliminate side effects, i.e. changes in the state that don’t really depend on the function input, making the program more difficult to understand and predict.

State representation and Transformation

In functional programming, functions transform input into output, without an intermediate representation of the current state.

Consider the following code:

def square(x):
    return x*x

input = [1, 2, 3, 4]
output = []
for v in input:
    output.append(square(v))

The code takes a list as input and iterates through its individual values applying the function square() to them; the final outcome is stored in the variable output which is initially an empty list, and is updated during the iteration.

Now, let’s consider the following functional style:

def square(x):
    return x*x

input = [1, 2, 3, 4]
output = map(square, input)

The logic is very similar: apply the function square() to all the element of the list given as input, but there is no internal state representation. The output is also a little bit different, as the map() function will return an iterator rather than a list.

Data are immutable

Related to the point above, the concept of immutable data can be summarised as: functions will return new data as the outcome of a computation, leaving the original input intact.

As mentioned earlier, Python is not a purely functional language, so there are both mutable and immutable data structures, e.g. lists are mutable arrays, while tuples are immutable:

>>> mutable = [1, 2, 3, 4]
>>> mutable[0] = 100
>>> mutable
[100, 2, 3, 4]
>>> mutable.append('hello')
>>> mutable
[100, 2, 3, 4, 'hello']

>>> immutable = (1, 2, 3, 4)
>>> immutable[0] = 100
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'tuple' object does not support item assignment
>>> immutable.append('hello')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'tuple' object has no attribute 'append'
>>> immutable
(1, 2, 3, 4)

First class functions and High-order functions

Functions being first class means that they can appear anywhere in a program, including return values and arguments of other functions. For example:

def greet(name):
    print("Hello %s!" % name)

say_hello = greet
say_hello('World')
# Hello World!

High-order functions are functions which can take functions as arguments (or return them as results). For example, this is what the map() function does, as one of the arguments is the name of the function to apply to a sequence. Notice the difference between the following two lines:

# some_func is an input of other_func()
out = other_func(some_func, some_data)
# the output of some_func() is the input of other_func()
out = other_func(some_func(some_data))

The first case represents a high-order function.

Recursion

“To understand recursion, you must understand recursion.”

A recursive function calls itself on a smaller input, until the problem is reduced to some base case. For example, the factorial can be calculated in Python as:

def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n-1)

Recursion is also a way to achieve iteration, usually in a more compact and elegant way, again without using internal state representation and without modifying the input data.

While it is possible to write recursive functions in Python, its interpreter doesn’t recognise tail recursion, an optimisation technique typically used by functional languages.

What else?

Besides the already mentioned map() function, there is more Functional Programming support in the Python standard library.

reduce() and filter()

The functions reduce() and filter() naturally belong together with the map() function. While map will apply a function over a sequence, producing a sequence as output, reduce will apply a function of two arguments cumulatively to the items in a sequence, in order to reduce the sequence to a single value.

In Python 3, reduce() is not a core built-infunction anymore and it has been moved to functools (still part of the standard library), i.e. in order to use it you need to import it as follows:

from functools import reduce

The use of filter() consists in applying a function to each item in a sequence, building a new sequence for those items that return true. For example:

def is_odd(n):
   return n%2 == 1
 
print(is_odd(2))
# False
print(is_odd(3))
# True
values = [1, 2, 3, 4, 5, 6]
result = filter(is_odd, values)
print(list(result))
# [1, 3, 5]

Notice: the term sequence has to be intended as iterable, and for both map() and filter() the return type is a generator (hence the use of list() to visualise the actual list).

lambda functions

Python supports the definition of lambda functions, i.e. small anonymous functions not necessarily bound to a name, at runtime. The use of anonymous functions is fairly common in Functional Programming, and it’s easy to see their use related to map/reduce/filter:

words = "The quick brown fox".split()
words_len = map(lambda w: len(w), words)
print(words)
# ['The', 'quick', 'brown', 'fox']
print(list(words_len))
# [3, 5, 5, 3]

Notice: the short lambda function above is just an example, in this case we could use len() directly.

itertools

The itertools module groups a number of functions related to iterators and their use/combination. Most of the basic and not-so-basic functionality that you might need to deal with iterators are already implemented there. The official Python documentation has a nice tutorial on Functional Programming, where the itertools module is described in details.

Generators for lazy-evaluation

Generator functions and generator expressions in Python provide objects that behave like an iterator (i.e. they can be looped over), without having the full content of the iterable loaded in memory. This concept is linked to lazy-evaluation, i.e. a call-by-need strategy where an item in a sequence is evaluated only when it’s needed, otherwise it’s not loaded in memory. This has repercussions on the amount of memory needed to deal with large sequences.

Some Final Thoughts

The topic of Functional Programming deserves definitely more than a blog post.

While Functional Programming has its benefits, Python is not a pure functional language. In particular, the mix of mutable and immutable data structures makes it difficult to truly separate the functional aspects of the language from the non-functional ones. The lack of some optimisations (e.g. no tail recursion) also makes some aspects of Functional Programming particularly costly, raising some questions about efficiency.

Even though Python is not openly inspired by other functional language, I think it’s always important to have an open mind regarding different programming paradigms. From this point of view, knowing what Python can (and can’t) do and how can only be beneficial for any Python user.

Please leave a comment or share if you liked (or disliked!) the article.

@MarcoBonzanini