February 2015 – Marco Bonzanini

Fuzzy String Matching in Python

Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern.
The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match.
Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).

Fuzzy String Matching can have different practical applications. Typical examples are spell-checking, text re-use detection (the politically correct way of calling plagiarism detection), spam filtering, as well as several applications in the bioinformatics domain, e.g. matching DNA sequences.

This article plays around with fuzzywuzzy, a Python library for Fuzzy String Matching.

Getting Started with Fuzzywuzzy

FuzzyWuzzy has been developed and open-sourced by SeatGeek, a service to find sport and concert tickets. Their original use case, as discussed in their blog, was the problem given by the many different ways of labelling the same event, adding or hiding location, dates, venue, etc. This problem is also arising with different entities like persons or companies.

To install the library, you can use pip as usual:

pip install fuzzywuzzy

The main modules in FuzzyWuzzy are called fuzz, for string-to-string comparisons, and process to compare a string with a list of strings.

Under the hood, FuzzyWuzzy uses difflib, part of the standard library, so there is nothing extra to install. We can anyway benefit from the performance of python-Levenshtein for sequence matching, so let’s also install this library:

pip install python-Levenshtein

Examples of Usage

Firstly, let’s import the main modules:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In order to calculate a similarity score between two strings, we can use the methods ratio() or partial_ratio():

fuzz.ratio("ACME Factory", "ACME Factory Inc.")
# 83
fuzz.partial_ratio("ACME Factory", "ACME Factory Inc.")
# 100

We can see how the ratio() function is confused by the suffix “Inc.” used in company names, but really the two strings refer to the same entity. This is captured by the partial ratio.

More examples:

fuzz.ratio('Barack Obama', 'Barack H. Obama')
# 89
fuzz.partial_ratio('Barack Obama', 'Barack H. Obama')
# 75

fuzz.ratio('Barack H Obama', 'Barack H. Obama')
# 97
fuzz.partial_ratio('Barack H Obama', 'Barack H. Obama')
# 92

Here we observe the opposite behaviour: different variations in Barack Obama’s name produce a lower score for the partial ratio, why is that? Probably because the extra token for the middle name is right in the middle of the string. For this particular case, we can benefit by other functions that tokenise the string and treat it as a set or as a sequence of words:

fuzz.token_sort_ratio('Barack Obama', 'Barack H. Obama')
# 92
fuzz.token_set_ratio('Barack Obama', 'Barack H. Obama')
# 100

fuzz.token_sort_ratio('Barack H Obama', 'Barack H. Obama')
# 100
fuzz.token_set_ratio('Barack H Obama', 'Barack H. Obama')
# 100

The token_* functions split the string on white-spaces, lowercase everything and get rid of non-alpha non-numeric characters, which means punctuation is ignored (as well as weird unicode symbols).

In case we have a list of options and we want to find the closest match(es), we can use the process module:

query = 'Barack Obama'
choices = ['Barack H Obama', 'Barack H. Obama', 'B. Obama']
# Get a list of matches ordered by score, default limit to 5
process.extract(query, choices)
# [('Barack H Obama', 95), ('Barack H. Obama', 95), ('B. Obama', 85)]

# If we want only the top one
process.extractOne(query, choices)
# ('Barack H Obama', 95)

Summary

This article has introduced Fuzzy String Matching, which is a well understood problem with some interesting practical applications.

Python has a very simple option to tackle the problem: the FuzzyWuzzy library, which is built on top of difflib (and python-Levenshtein for speed). It can take a while to figure out how to scope our string matching problem, but the easy interface of fuzzywuzzy should help speeding up the development.

Phrase Match and Proximity Search in Elasticsearch

The case of multi-term queries in Elasticsearch offers some room for discussion, because there are several options to consider depending on the specific use case we’re dealing with.

Multi-term queries are, in their most generic definition, queries with several terms. These terms could be completely unrelated, or they could be about the same topic, or they could even be part of a single specific concept. All these scenarios call for different configurations. This articles discusses some of the options with Elasticsearch.

Sample Documents

Assuming elasticsearch is up and running on your local machine, you can download the script which creates the data set used in the following examples. These are the created documents:

Doc ID	Content
1	This is a brown fox
2	This is a brown dog
3	This dog is really brown
4	The dog is brown but this document is very very long
5	There is also a white cat
6	The quick brown fox jumps over the lazy dog

Notice that for the sake of these examples, we’re using the default configuration which means using the default TF-IDF scoring function from Lucene, that include some score normalisation based on the document length (shorter documents are promoted).

In order to run all the following queries, the basic option is to use curl, e.g.:

curl -XPOST http://localhost:9200/test/articles/_search?pretty=true -d '{THE QUERY CODE HERE}'

although one could embed the query in Python code as discussed in a previous post (or in any programming language that allows to do some REST calls). With the pretty=true parameter, the JSON output will be more readable on the shell.

A General Purpose Query

The first example of query is for the simple case of terms which may or may not be related. In this scenario, we decide to use a classic match query. This kind of query does not impose any restriction between multiple terms, but of course will promote documents which contain more query terms.

{
    "query": {
        "match": {
            "content": {
                "query": "quick brown dog"
            }
        }
     }
}

This query retrieves 5 documents, in this order:

Pos	Doc ID	Content	Score
1	6	The quick brown fox jumps over the lazy dog	0.81502354
2	2	This is a brown dog	0.26816052
3	3	This dog is really brown	0.26816052
4	4	The dog is brown but this document is very very long	0.15323459
5	1	This is a brown fox	0.055916067

We can notice how the first document has a much higher score, because it’s the only one containing all the query terms. The documents in position 2 and 3 share the same score, because they have the same number of matches (two terms) and the same document lenght. The document in fourth position instead, despite having the same number of matches as the previous two, has a lower score because it’s much longer, so the document length normalisation penalises it. We can also notice how the last document has a very low score and it’s probably irrelevant.

Precision on multi-term query can be controlled by specifying some arbitrary threashold for the number of terms which should be matched. For example, we can re-write the query as:

{
    "query": {
        "match": {
            "content": {
                "query": "quick brown dog",
                "minimum_should_match": 75%
            }
        }
     }
}

The output will be basically the same, with the exeption of having only the top four documents. This is because the fifth document, “This is a brown fox”, only matches 1/3 of the query terms, which is below 75%. You can experiment with different thresholds for minimum match, keeping in mind that there is a balance to find between removing unrelevant documents and not losing the relevant ones.

The Case of Phrase Matching

In the previous example, the query terms were completely unrelated, so the query “quick brown dog” also retrieved brown foxes and non-quick dogs. What if we need an exact match of the query? More precisely, what if we need to match all the query terms in their relative position? This is the case for named entities like “New York”, where the two terms individually don’t convey the same meaning as the two of them concatenated in this order.

Elasticsearch has an option for this: match_phrase. The previous query can be rewritten as:

{
    "query": {
        "match_phrase": {
            "content": "quick brown dog"
        }
    }
}

We immediately see that the query returns an empty result set: there is no document about quick brown dogs. Let’s re-write the query in a less restrictive way, dropping the “quick” term:

{
    "query": {
        "match_phrase": {
            "content": "brown dog"
        }
    }
}

Now we can see how the query retrieves only one document, precisely document 2, the only one to match the exact phrase.

Proximity Search

Sometimes a phrase match can be too restrictive. What if we’re not really interested in a precise match, but we’d rather retrieve documents where the query terms occur somehow close to each other. This is an example of proximity search: the order of the terms doesn’t really matter, as long as they occur somehow within the same context. This concept is less restrictive than a pure phrase match, but still stronger than a general purpose query.

In order to achieve proximity search, we simply need to define the search window, so how far we allow the terms to be. This is called slop in Elasticsearch/Lucene terminology. The change to the previous code is really minimal, for example for a slop/window of 3 terms:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 3
            }
        }
    }
}

The result of the query:

Pos	Doc ID	Content	Score
1	2	This is a brown dog	0.9547657
2	4	The dog is brown but this document is very very long	0.2727902

We immediately see that the second document is also relevant to the query, but it was missed by the original phrase match. We can also try with a bigger slop, e.g.:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 4
            }
        }
    }
}

which retrieves the following:

Pos	Doc ID	Content	Score
1	2	This is a brown dog	0.9547657
2	3	This dog is really brown	0.4269842
3	4	The dog is brown but this document is very very long	0.2727902

The new result, document 3, is also still relevant. On the other side, if we keep increasing the slop, at some point we will end up including non-relevant, or less relevant, results. It’s hence important to understand the needs of the specific scenario and find a balance between not missing relevant results and including non-relevant results.

Within-Sentence Proximity Search

A variation of the proximity search discussed above consists in the need to match terms occurring in a specific context. Such a context could be the same sentence, the same paragraph, the same section, etc. The difference with what we already discussed in the previous paragraph is that here we might have a specific structure (sections, sentences, …) but not a specific window/slop size in mind.

Let’s assume the “content” field of our documents is a list of sentences, so we want to perform proximity search within a sentence. An example of document with two sentences:

{
    "content": ["This is a brown fox", "This is white dog"]
}

The trick to allow within-sentence search is to define a slop which is big enough to capture the sentence length, and to use it as a position offset:

{
    "properties": {
        "content": {
            "type": "string",
            "position_offset_gap": 100
        }
    }
}

Here the value 100 is arbitrary and is “big enough”. Pushing this configuration to our index, we force the terms to jump 100 positions ahead when there is a new sentence. In the previous document, if the term “fox” is in position 5, the following term “This” will be in position 106 rather than 6, because it’s in a new sentence. You can download the full script to implement sentence-based proximity search, with the updated documents to reflect the sentence structure, keeping in mind that applying this option to an existing data set requires re-indexing.

The value of the position offset can now be used as slop value:

{
    "query": {
        "match_phrase": {
            "content": {
                "query": "brown dog",
                "slop": 100
            }
        }
    }
}

This query will return four documents. More precisely, document 1, which mentions a white dog and a brown fox, will not be retrieved, because the two terms appear in different sentences.

Summary

We have explored some options from Elasticsearch to improve the results for queries with multiple terms. We have seen how Elasticsearch provides these functionality in a fairly easy way. The starting point is to understand the specific use case that we’re trying to tackle, and from here we have a set of choices. Depending on the scenario, we might want to choose one between:

a simple match search
a match search with a minimum match ratio
a phrase-based match
a phrase match with a slop for proximity search
a phrase match with a slop which matches the position offset specified in the index, for sentence-based (or any other context-based) proximity search

PyData London Meetup 2015-02-03

A quick update with my impressions on the last PyData London meet-up I attended this evening.

I missed the past couple of meet-ups because of some clash on my personal schedule, so this was my first time in the new venue, still close to Old Street, hosted by Lyst. A nice big room to accomodate many many people (more than 200 probably?), and a nice selection of craft beers.

We started with the usual initial community-related announcements, including the impressive achievement of the group: 1,000+ members on the meetup.com dedicated page! Well done guys!

The core topic of the evening was data visualisation. For this reason, the main candidate for Python Module Of The Month was (… drum roll …) matplotlib :-D

The first talk, “Thinking about data visualisation”, was given by Andy Kirk from Visualising Data. Thinking was an important keyword of the presentation: after an initial comparison between talent (e.g. artistic, creative) vs thinking, Andy went on discussing different aspects of thinking and how our thinking can provide more value to the visualisation we are proposing. Long story short, you don’t have to be an artist to create useful visualisation, assuming you take the context and the overall story into account.

The second talk, “Lies, damned lies and dataviz”, was given by Andrew Clegg from Etsy. Andrew started the presentation with some argument in support of providing good dataviz, and then he continued with a great sequence of examples of bad dataviz, discussing how some of them simply confuse the user without providing any additional insight, while others really trick the user into wrong conclusions. Whether these are damned lies or just bad design choices… well that’s another story.

Both presentations were really great, and both of them technology-agnostic, which is something I enjoyed for some reason.

During the final lightning-and-community-talks, the interesting news was the announcement of a new O’Reilly book on data visualisation with Python and Javascript (link to come).

Overall a very nice evening, congratulations to all the organisers.

PyData London meetup.com group:
http://www.meetup.com/PyData-London-Meetup/

How to Query Elasticsearch with Python

Elasticsearch is an open-source distributed search server built on top of Apache Lucene. It’s a great tool that allows to quickly build applications with full-text search capabilities. The core implementation is in Java, but it provides a nice REST interface which allows to interact with Elasticsearch from any programming language.

This article provides an overview on how to query Elasticsearch from Python. There are two main options:

Implement the REST-API calls to Elasticsearch
Use one of the Python libraries that does the above for you

Quick Intro on Elasticsearch

Elasticsearch is developed in Java on top of Lucene, but the format for configuring the index and querying the server is JSON. Once the server is running, by default it’s accessible at localhost:9200 and we can start sending our commands via e.g. curl:

curl -XPOST http://localhost:9200/test/articles/1 -d '{
    "content": "The quick brown fox"
}'

This commands creates a new document, and since the index didn’t exist, it also creates the index. Specifically, the format for the URL is:

http://hostname:port/index_name/doc_type/doc_id

so we have just created an index “test” which contains documents of type “articles”. The document has only one field, “content”. Since we didn’t specify, the content is indexed using the default Lucene analyzer (which is usually a good choice for standard English). The document id is optional and if we don’t explicitly give one, the server will create a random hash-like one.

We can insert a few more documents, see for example the file create_index.sh from the code snippets on github.

Once the documents are indexed, we can perform a simple search, e.g.:

curl -XPOST http://localhost:9200/test/articles/_search?pretty=true -d '{
    "query": {
        "match": {
            "content": "dog"
        }
    }
}'

Using the sample documents above, this query should return only one document. Performing the same query over the term “fox” rather than “dog” should give instead four documents, ranked according to their relevance.

How the Elasticsearch/Lucene ranking function works, and all the countless configuration options for Elasticsearch, are not the focus of this article, so bear with me if we’re not digging into the details. For the moment, we’ll just focus on how to integrate/query Elasticsearch from our Python application.

Querying Elasticsearch via REST in Python

One of the option for querying Elasticsearch from Python is to create the REST calls for the search API and process the results afterwards. The requests library is particularly easy to use for this purpose. We can install it with:

pip install requests

The sample query used in the previous section can be easily embedded in a function:

def search(uri, term):
    """Simple Elasticsearch Query"""
    query = json.dumps({
        "query": {
            "match": {
                "content": term
            }
        }
    })
    response = requests.get(uri, data=query)
    results = json.loads(response.text)
    return results

The “results” variable will be a dictionary loaded from the JSON response. We can pretty-print the JSON, to observe the full output and understand all the information it provides, but again this is beyond the scope of this post. So we can simply print the results nicely, one document per line, as follows:

def format_results(results):
    """Print results nicely:
    doc_id) content
    """
    data = [doc for doc in results['hits']['hits']]
    for doc in data:
        print("%s) %s" % (doc['_id'], doc['_source']['content']))

Similarly, we can create new documents:

def create_doc(uri, doc_data={}):
    """Create new document."""
    query = json.dumps(doc_data)
    response = requests.post(uri, data=query)
    print(response)

with the doc_data variable being a (Python) dictionary which resembles the structure of the document we’re creating.

You can see a full working toy example in the rest.py file in the Gist on github.

Querying Elasticsearch Using elasticsearch-py

The requests library is fairly easy to use, but there are several options in terms of libraries that abstract away the concepts related to the REST API and focus on Elasticsearch concepts. In particular, the official Python extension for Elasticsearch, called elasticsearch-py, can be installed with:

pip install elasticsearch

It’s fairly low-level compared to other client libraries with similar capabilities, but it provides a consistent and easy to extend API.

We can replicate the search used with the requests library, as well as the result print-out, just using a few lines of Python:

from elasticsearch import Elasticsearch

es = Elasticsearch()
res = es.search(index="test", doc_type="articles", body={"query": {"match": {"content": "fox"}}})
print("%d documents found" % res['hits']['total'])
for doc in res['hits']['hits']:
    print("%s) %s" % (doc['_id'], doc['_source']['content']))

In a similar fashion, we can re-create the functionality of adding an extra document:

es.create(index="test", doc_type="articles", body={"content": "One more fox"})

The full functionality of this client library are well described in the documentation.

Summary

This article has briefly discussed a couple of options to integrate Elasticsearch into a Python application. The key points of the discussion are:

We can interact with Elasticsearch using the REST API
The requests library is particularly useful for this purpose, and probably much cleaner and easier to use than the urllib module (part of the standard library)
Many other Python libraries implement an Elasticsearch client, abstracting away the concept related to the REST API and focusing on Elasticsearch concepts
We have seen simple examples with elasticsearch-py

The full code for the examples is available as usual in a Gist:
https://gist.github.com/bonzanini/fe2ff32116f16e3009be