Python

Fuzzy String Matching in Python

Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern.
The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match.
Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).

Fuzzy String Matching can have different practical applications. Typical examples are spell-checking, text re-use detection (the politically correct way of calling plagiarism detection), spam filtering, as well as several applications in the bioinformatics domain, e.g. matching DNA sequences.

This article plays around with fuzzywuzzy, a Python library for Fuzzy String Matching.

Getting Started with Fuzzywuzzy

FuzzyWuzzy has been developed and open-sourced by SeatGeek, a service to find sport and concert tickets. Their original use case, as discussed in their blog, was the problem given by the many different ways of labelling the same event, adding or hiding location, dates, venue, etc. This problem is also arising with different entities like persons or companies.

To install the library, you can use pip as usual:

pip install fuzzywuzzy

The main modules in FuzzyWuzzy are called fuzz, for string-to-string comparisons, and process to compare a string with a list of strings.

Under the hood, FuzzyWuzzy uses difflib, part of the standard library, so there is nothing extra to install. We can anyway benefit from the performance of python-Levenshtein for sequence matching, so let’s also install this library:

pip install python-Levenshtein

Examples of Usage

Firstly, let’s import the main modules:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In order to calculate a similarity score between two strings, we can use the methods ratio() or partial_ratio():

fuzz.ratio("ACME Factory", "ACME Factory Inc.")
# 83
fuzz.partial_ratio("ACME Factory", "ACME Factory Inc.")
# 100

We can see how the ratio() function is confused by the suffix “Inc.” used in company names, but really the two strings refer to the same entity. This is captured by the partial ratio.

More examples:

fuzz.ratio('Barack Obama', 'Barack H. Obama')
# 89
fuzz.partial_ratio('Barack Obama', 'Barack H. Obama')
# 75

fuzz.ratio('Barack H Obama', 'Barack H. Obama')
# 97
fuzz.partial_ratio('Barack H Obama', 'Barack H. Obama')
# 92

Here we observe the opposite behaviour: different variations in Barack Obama’s name produce a lower score for the partial ratio, why is that? Probably because the extra token for the middle name is right in the middle of the string. For this particular case, we can benefit by other functions that tokenise the string and treat it as a set or as a sequence of words:

fuzz.token_sort_ratio('Barack Obama', 'Barack H. Obama')
# 92
fuzz.token_set_ratio('Barack Obama', 'Barack H. Obama')
# 100

fuzz.token_sort_ratio('Barack H Obama', 'Barack H. Obama')
# 100
fuzz.token_set_ratio('Barack H Obama', 'Barack H. Obama')
# 100

The token_* functions split the string on white-spaces, lowercase everything and get rid of non-alpha non-numeric characters, which means punctuation is ignored (as well as weird unicode symbols).

In case we have a list of options and we want to find the closest match(es), we can use the process module:

query = 'Barack Obama'
choices = ['Barack H Obama', 'Barack H. Obama', 'B. Obama']
# Get a list of matches ordered by score, default limit to 5
process.extract(query, choices)
# [('Barack H Obama', 95), ('Barack H. Obama', 95), ('B. Obama', 85)]

# If we want only the top one
process.extractOne(query, choices)
# ('Barack H Obama', 95)

Summary

This article has introduced Fuzzy String Matching, which is a well understood problem with some interesting practical applications.

Python has a very simple option to tackle the problem: the FuzzyWuzzy library, which is built on top of difflib (and python-Levenshtein for speed). It can take a while to figure out how to scope our string matching problem, but the easy interface of fuzzywuzzy should help speeding up the development.

21 thoughts on “Fuzzy String Matching in Python

  1. Interesting article.

    I am trying to use it but the process matching sound really weird…
    My query is : “Eclair Chocolat”
    Choices are “[‘chou’, ‘chouchou’]”
    And i got 67% similarities… What the… ?!

    Otherwise, it is fun to use :)

    Like

  2. Right away, I got some different results following your examples.
    The first one:
    In [6]: fuzz.ratio(“ACME Factory”, “ACME Factory Inc.”)
    Out[6]: 83 # you had 62
    the fourth one:
    In [10]: fuzz.partial_ratio(‘Barack H Obama’, ‘Barack H. Obama’)
    Out[10]: 92 # you had 75

    I’m using IPython 3.1.0 with Python 3.4 on Ubuntu 15.04. I don’t know what you are using but don’t think it should make this much difference, should it? In both cases, the difference is about 20 points, which has to be significant in any use case analysis, don’t you think? What accounts for the difference?

    Like

    1. Thanks Malik,

      regarding the first example (ACME factory), you are correct, very likely to be a cut-and-paste error on my side at the time, so I fixed it.

      regarding the other examples with Barack Obama, please notice the “H” for the middle name, that’s what makes the difference, e.g.
      fuzz.partial_ratio(‘Barack H Obama’, ‘Barack H. Obama’) # 92
      fuzz.partial_ratio(‘Barack Obama’, ‘Barack H. Obama’) # 75

      Like

      1. How is this being used in the real world? Must the user provide the comparisons, or can it be attached to a search mechanism to find similar strings in the first place?

        Like

      2. If you look at the blog post by SeatGeek linked above, you’ll have a nice example of a real use-case. In general, you’ll need to provide the strings for comparison and the algorithms will give you some sort of similarity score between two strings. How you interpret these scores is up to you (e.g. you can decide to look for some similarity threshold).

        Like

  3. Interesting article, thanks. I am looking at a scenario where you have a finite set of phrases but you have to find the closest phonetic match (something like the process.extract function above but the score is by the similarity in their pronunciation).
    For eg. say the possible words are {pause, up, down, back, next, quit} and the word is boss then the best match should be pause.
    I am working on a speech recognition project and the software sometimes interprets the spoken word to something different. Is there any python module that can accomplish this?

    Thanks for your time in advance.
    Sarang

    Like

    1. Hi Sarang, speech recognition is not really my expertise so I’m afraid I can’t tell you much. There are a few libs for phonetic algorithms out there, but I don’t have a specific recommendation.
      Cheers,
      Marco

      Like

  4. I am using fuzzywuzzy to match a file with 500k records to a catalog with 100k records. So far the performance has been very poor. Did you face any performance issues?

    Like

    1. Hi,
      if by performance you mean speed, so far I’ve used it mainly in offline jobs, where speed was not a problem. But in principle, yes, fuzzy string matching can be computationally expensive. You might want to restrict the space of your problem first, e.g. find ways to avoid a brute force with 500k x 100k matches.
      If instead you mean accuracy, I’d say it depends on the application and the expected output.

      Like

  5. Hi Marco,
    Fuzzy wuzzy seems like a great package, and I am hoping to use it to do fuzzy matching of Indian district and village names in larger. I’m trying to use the partial_ratio function. Unfortunately, I’m running into some problems with that. Basically, I am getting matching scores (usually 50, though sometimes different) that are less than 100, even when the second string is contained in the first.

    In particular,
    >>> fuzz.partial_ratio(‘Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance -regarding,’,’Orissa’)
    50
    >>> fuzz.partial_ratio(‘Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance -regarding,’,’Barbil’)
    50

    But when I shave off some of the end of the first long string, I get 100% partial match:
    >>> fuzz.partial_ratio(‘Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear’,’Orissa’)
    100
    >>>fuzz.partial_ratio(‘Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear’,’Barbil’)
    100

    Do you have any guidance as to what might be happening? Thank you very much!

    Like

  6. Hi Marco:

    Thanks for this awesome article and walk-through. I’m wondering if this can be used in online or incremental clustering?

    My idea is as follows:
    I’ve google search short string phrases stored in Azure SQL DB. We add new search phrases at the end of the day as a batch process. Now we want to cluster the data so that similar string search phrases will be grouped into clusters and if there are new unique search phrases, then we add one more cluster.

    Are there implementation in this area. Can be R/Python

    Thanks
    Daniel

    Like

    1. Hi Daniel
      it’s possible but potentially I’d expect some performance issues with larger data sets. I’m not aware of existing implementations I’m afraid.

      Cheers
      Marco

      Like

  7. Hi Marco,

    Let say i am using fuzz.ratio(“7 from 7”, “great”) – it results in 15 percentage, how it could 15% is there any better algorithm to match the string.

    Like

    1. “Return a measure of the sequences’ similarity as a float in the range [0, 1].

      Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1.0 if the sequences are identical, and 0.0 if they have nothing in common.”

      fuzzywuzzy uses difflib, looking at the ratio function here https://docs.python.org/2/library/difflib.html we see the above quote describing how ratio() works.

      13 elements by my count, the two r’s are the only matches, so 2/13, which is 15.384, which gets rounded down to 15%.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s