Elasticsearch · Relevance · Search

Tuning Relevance in Elasticsearch with Custom Boosting

Elasticsearch offers different options out of the box in terms of ranking function (similarity function, in Lucene terminology). The default ranking function is a variation of TF-IDF, relatively simple to understand and, thanks to some smart normalisations, also quite effective in practice.

Each use case is a different story so sometimes the default ranking function doesn’t works as well as it does for the general case. In particular, when the collection starts being fairly diverse in terms of document structure (different document types, with different fields, with different size, etc.), we need to adjust the default behaviour so that we can tune relevance.

Static Boosting

The first option we discuss is static boosting, or in other words boosting at indexing time. The boost factor is calculated depending on the nature of the document.

For example, if we have different types of documents in the index, some type could be given more importance because of a specific business rule, or just to compensate for different morphology, as short documents are promoted by the default ranking, so on the other side long documents are penalised.

A different example is the need to apply some sort of popularity/authority score, such as PageRank or similar, that takes into account how a document is linked to other documents.

Once we have a clear formula for the boost factor, we can store its value as an extra field in the document, and use a function_score query to incorporate it into the relevance score, e.g.:

{
    "query": {
        "function_score": {
            "query": {  
                "match": {
                    "some_field_name": "query terms here"
                }
            },
            "functions": [{
                "field_value_factor": { 
                    "field": "my_boost_field"
                }
            }],
            "score_mode": "multiply"
        }
    }
}

In this example, the boost factor is stored in my_boost_field: its value and the relevance score coming from the similarity function are multiplied to achieve the final score used for the ranking.

Because of the nature of this approach, if we want to give a different boosting to the document, the document must be re-indexed in order to apply. In other words, we shouldn’t use this approach if the business rules that define the boosting are likely to change, or in general if we know that over time the relative importance of a document will change (and its boost factor with it).

Dynamic boosting

A different approach consists in the possibility of boosting the documents at query time.

In a previous post – How to Promote Recent Articles in Elasticsearch – we discussed an example of dynamic boosting used to promote recently published documents using a so-called decay function. In this case, we used the value of a date field to calculate the relative importance of the documents.

When our query is a combination of different sub-queries, we can assign different weights to the different components, for example:

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "title": {
              "query": "some query terms here",
              "boost": 3
            }
          }
        },
        {
          "match": { 
            "content": "other query terms there"
          }
        }
      ]
    }
  }
}

This is a boolean OR query with two clauses, one over the field title and one over the field content. The clauses can be any type of query, and the boosting factor set to 3 means the the first clause is three times more important than the second clause.

It is worth noticing that because of the length normalisation, short fields are implicitely boosted by the default similarity, and the title is usually shorter than the whole content (similar consideration for subject vs body of a message, etc.).

When the query is a multi_match one, we can plug the boost factor in-line, for example:

{
  "multi_match" : {
    "query": "some query terms here", 
    "fields": [ "title^3", "content" ] 
  }
}

In this case, the same match query is execture over the two different fields, and using the caret syntax we can specify, like in the previous example, that the title is three times more important than the content.

Sensible values for this kind of boost factors are usually in the range 1-10, or 1-15 max, as the Elasticsearch manual suggests. This is because some boost normalisation happens internally, so bigger values will not have much impact.

Final Thoughts

Improving relevance is a crucial part of a search application, and often the final tuning is a matter of try-and-see. Boosting – in particular dynamic boosting – allows some level of control over relevance tuning, in a way that is easy to experiment with, and easy to change in order to adapt for specific business rules. With dynamic boosting, there is no need to re-index, so this is an interesting approach for relevance tuning.

Elasticsearch offers a boosting option for nearly every type of query, and on top of this, a function_score can also be employed to further personalise the scoring function.

@MarcoBonzanini

2 thoughts on “Tuning Relevance in Elasticsearch with Custom Boosting

  1. Hi,
    Nice post. Do you have any reference about this paragraph ?
    “Sensible values for this kind of boost factors are usually in the range 1-10, or 1-15 max, as the Elasticsearch manual suggests. ”

    I have not been able to find anything about it.

    Thank you

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s