The default ranking options in Elasticsearch (and Lucene) are purely based on content. Sometimes, it is useful to mix other factors in, such as the location/distance of a restaurant, to promote venues close to the user’s location, or the publication date of an article, to promote recent articles.
This post discusses a simple solution which is already integrated in Elasticsearch and can be enabled at query time: a decay function.
Default Ranking in Elasticsearch
The default ranking function (called similarity in Lucene terminology) is a normalised version of TF-IDF. The normalisation is based on document length and promotes shorter documents over longer ones. Informally, the idea behind this approach is that longer documents will have higher TF’s just because they have more content, so something should be done to avoid penalising short documents. In practice, this works well most of the time, but in case this normalised TF-IDF is not what you need, there are plenty of other options, including BM25, Divergence From Randomness or Language Model. All these ranking functions are based on the content of a field, i.e. they are based on term statistics.
To showcase how the default ranking works, and how we can affect it later, I’ll use a sample database of articles I’ve published on this blog (click here for the Gist). The Gist will create an index on a local instance of Elasticsearch, where each document will have the fields title (string) and published (date).
Once the index is created, we can run a very basic query:
{ "query": { "match": { "title": "python" } } }
The query will look for the term python in the title field, and you should see how the documents ranking higher are the ones with shorter titles. This is because the term python occurs only once in each title, so what makes the difference in terms of scoring is the document length normalisation.
Function Score and Decay Functions
Elasticsearch provides an extensive support for custom scoring via the query DSL, meaning that relevance can be tweaked at query time without re-indexing. Using a function_score query, we can smooth the default scoring function to include freshness (a.k.a. recency) as a component for relevance.
Specifically, the use of some scoring functions called decay functions (namely gaussian, exponential and linear) is the key to integrate numeric fields, date fields and geo fields in the picture.
As an example, let’s consider the following query:
{ "query": { "function_score": { "query": { "match": { "title": "python"} }, "gauss": { "published": { "scale": "8w" } } } } }
This query produces the same result set as the previous one, but the ranking will be very different. In particular, recent articles will be pushed on top. The scale parameter is what governs how quickly the score drops when moving away from the origin. In this example, we have used a range of 8 weeks, meaning that articles older than that will have a very low score.
Summary
Content is the key aspect to consider in most ranking problems which involve textual data. Sometimes though, content is not the only aspect to consider. This article has showcased decay functions as a simple Elasticsearch option to influence ranking by including freshness/recency as a core component to be mixed with content. Since decay functions work on numeric fields, date fields and geo-points, they can be used to integrate location/distance, recency and other numerical aspects like popularity (e.g. number of votes) into the ranking function. This can be done at query time, without the need to re-index.