Elasticsearch · Search

Phrase Match and Proximity Search in Elasticsearch

The case of multi-term queries in Elasticsearch offers some room for discussion, because there are several options to consider depending on the specific use case we’re dealing with.

Multi-term queries are, in their most generic definition, queries with several terms. These terms could be completely unrelated, or they could be about the same topic, or they could even be part of a single specific concept. All these scenarios call for different configurations. This articles discusses some of the options with Elasticsearch.

Sample Documents

Assuming elasticsearch is up and running on your local machine, you can download the script which creates the data set used in the following examples. These are the created documents:

Doc ID Content
1 This is a brown fox
2 This is a brown dog
3 This dog is really brown
4 The dog is brown but this document is very very long
5 There is also a white cat
6 The quick brown fox jumps over the lazy dog

Notice that for the sake of these examples, we’re using the default configuration which means using the default TF-IDF scoring function from Lucene, that include some score normalisation based on the document length (shorter documents are promoted).

In order to run all the following queries, the basic option is to use curl, e.g.:

curl -XPOST http://localhost:9200/test/articles/_search?pretty=true -d '{THE QUERY CODE HERE}'

although one could embed the query in Python code as discussed in a previous post (or in any programming language that allows to do some REST calls). With the pretty=true parameter, the JSON output will be more readable on the shell.

A General Purpose Query

The first example of query is for the simple case of terms which may or may not be related. In this scenario, we decide to use a classic match query. This kind of query does not impose any restriction between multiple terms, but of course will promote documents which contain more query terms.

{
    "query": {
        "match": {
            "content": {
                "query": "quick brown dog"
            }
        }
     }
}

This query retrieves 5 documents, in this order:

Pos Doc ID Content Score
1 6 The quick brown fox jumps over the lazy dog 0.81502354
2 2 This is a brown dog 0.26816052
3 3 This dog is really brown 0.26816052
4 4 The dog is brown but this document is very very long 0.15323459
5 1 This is a brown fox 0.055916067

We can notice how the first document has a much higher score, because it’s the only one containing all the query terms. The documents in position 2 and 3 share the same score, because they have the same number of matches (two terms) and the same document lenght. The document in fourth position instead, despite having the same number of matches as the previous two, has a lower score because it’s much longer, so the document length normalisation penalises it. We can also notice how the last document has a very low score and it’s probably irrelevant.

Precision on multi-term query can be controlled by specifying some arbitrary threashold for the number of terms which should be matched. For example, we can re-write the query as:

{
    "query": {
        "match": {
            "content": {
                "query": "quick brown dog",
                "minimum_should_match": 75%
            }
        }
     }
}

The output will be basically the same, with the exeption of having only the top four documents. This is because the fifth document, “This is a brown fox”, only matches 1/3 of the query terms, which is below 75%. You can experiment with different thresholds for minimum match, keeping in mind that there is a balance to find between removing unrelevant documents and not losing the relevant ones.

The Case of Phrase Matching

In the previous example, the query terms were completely unrelated, so the query “quick brown dog” also retrieved brown foxes and non-quick dogs. What if we need an exact match of the query? More precisely, what if we need to match all the query terms in their relative position? This is the case for named entities like “New York”, where the two terms individually don’t convey the same meaning as the two of them concatenated in this order.

Elasticsearch has an option for this: match_phrase. The previous query can be rewritten as:

{
    "query": {
        "match_phrase": {
            "content": "quick brown dog"
        }
    }
}

We immediately see that the query returns an empty result set: there is no document about quick brown dogs. Let’s re-write the query in a less restrictive way, dropping the “quick” term:

{
    "query": {
        "match_phrase": {
            "content": "brown dog"
        }
    }
}

Now we can see how the query retrieves only one document, precisely document 2, the only one to match the exact phrase.

Proximity Search

Sometimes a phrase match can be too restrictive. What if we’re not really interested in a precise match, but we’d rather retrieve documents where the query terms occur somehow close to each other. This is an example of proximity search: the order of the terms doesn’t really matter, as long as they occur somehow within the same context. This concept is less restrictive than a pure phrase match, but still stronger than a general purpose query.

In order to achieve proximity search, we simply need to define the search window, so how far we allow the terms to be. This is called slop in Elasticsearch/Lucene terminology. The change to the previous code is really minimal, for example for a slop/window of 3 terms:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 3
            }
        }
    }
}

The result of the query:

Pos Doc ID Content Score
1 2 This is a brown dog 0.9547657
2 4 The dog is brown but this document is very very long 0.2727902

We immediately see that the second document is also relevant to the query, but it was missed by the original phrase match. We can also try with a bigger slop, e.g.:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 4
            }
        }
    }
}

which retrieves the following:

Pos Doc ID Content Score
1 2 This is a brown dog 0.9547657
2 3 This dog is really brown 0.4269842
3 4 The dog is brown but this document is very very long 0.2727902

The new result, document 3, is also still relevant. On the other side, if we keep increasing the slop, at some point we will end up including non-relevant, or less relevant, results. It’s hence important to understand the needs of the specific scenario and find a balance between not missing relevant results and including non-relevant results.

Within-Sentence Proximity Search

A variation of the proximity search discussed above consists in the need to match terms occurring in a specific context. Such a context could be the same sentence, the same paragraph, the same section, etc. The difference with what we already discussed in the previous paragraph is that here we might have a specific structure (sections, sentences, …) but not a specific window/slop size in mind.

Let’s assume the “content” field of our documents is a list of sentences, so we want to perform proximity search within a sentence. An example of document with two sentences:

{
    "content": ["This is a brown fox", "This is white dog"]
}

The trick to allow within-sentence search is to define a slop which is big enough to capture the sentence length, and to use it as a position offset:

{
    "properties": {
        "content": {
            "type": "string",
            "position_offset_gap": 100
        }
    }
}

Here the value 100 is arbitrary and is “big enough”. Pushing this configuration to our index, we force the terms to jump 100 positions ahead when there is a new sentence. In the previous document, if the term “fox” is in position 5, the following term “This” will be in position 106 rather than 6, because it’s in a new sentence. You can download the full script to implement sentence-based proximity search, with the updated documents to reflect the sentence structure, keeping in mind that applying this option to an existing data set requires re-indexing.

The value of the position offset can now be used as slop value:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 100
            }
        }
    }
}

This query will return four documents. More precisely, document 1, which mentions a white dog and a brown fox, will not be retrieved, because the two terms appear in different sentences.

Summary

We have explored some options from Elasticsearch to improve the results for queries with multiple terms. We have seen how Elasticsearch provides these functionality in a fairly easy way. The starting point is to understand the specific use case that we’re trying to tackle, and from here we have a set of choices. Depending on the scenario, we might want to choose one between:

  • a simple match search
  • a match search with a minimum match ratio
  • a phrase-based match
  • a phrase match with a slop for proximity search
  • a phrase match with a slop which matches the position offset specified in the index, for sentence-based (or any other context-based) proximity search

4 thoughts on “Phrase Match and Proximity Search in Elasticsearch

  1. How to have fuzziness in phrase matching here..for example in the quick brown dog example How can i get the correct matching results if say the search query has spelling errors ?

    Like

  2. Hello,
    I have never used Elastic Search before, can you please help me out on this little problem:

    I have divided multiple sentences into clauses(like A,B,C…..Z).
    Main sentence:”This is a brown fox that runs very fast and it likes to eat rabbits.”
    Clauses:[“This is a brown fox”,”that runs very fast”,”it likes to eat rabbits”]

    Now I want to search “fox” and “rabbits” in these clauses keeping a proximity of maximum left and right phrases to 2.
    This phrase proximity value is to be variable. Can we implement this using the approach you described above or is there any other better approach that you came across?

    Regards

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s