Fuzzy String Matching in Python

Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern.
The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match.
Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).

Fuzzy String Matching can have different practical applications. Typical examples are spell-checking, text re-use detection (the politically correct way of calling plagiarism detection), spam filtering, as well as several applications in the bioinformatics domain, e.g. matching DNA sequences.

This article plays around with fuzzywuzzy, a Python library for Fuzzy String Matching.

Getting Started with Fuzzywuzzy

FuzzyWuzzy has been developed and open-sourced by SeatGeek, a service to find sport and concert tickets. Their original use case, as discussed in their blog, was the problem given by the many different ways of labelling the same event, adding or hiding location, dates, venue, etc. This problem is also arising with different entities like persons or companies.

To install the library, you can use pip as usual:

pip install fuzzywuzzy

The main modules in FuzzyWuzzy are called fuzz, for string-to-string comparisons, and process to compare a string with a list of strings.

Under the hood, FuzzyWuzzy uses difflib, part of the standard library, so there is nothing extra to install. We can anyway benefit from the performance of python-Levenshtein for sequence matching, so let’s also install this library:

pip install python-Levenshtein

Examples of Usage

Firstly, let’s import the main modules:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In order to calculate a similarity score between two strings, we can use the methods ratio() or partial_ratio():

fuzz.ratio("ACME Factory", "ACME Factory Inc.")
# 83
fuzz.partial_ratio("ACME Factory", "ACME Factory Inc.")
# 100

We can see how the ratio() function is confused by the suffix “Inc.” used in company names, but really the two strings refer to the same entity. This is captured by the partial ratio.

More examples:

fuzz.ratio('Barack Obama', 'Barack H. Obama')
# 89
fuzz.partial_ratio('Barack Obama', 'Barack H. Obama')
# 75

fuzz.ratio('Barack H Obama', 'Barack H. Obama')
# 97
fuzz.partial_ratio('Barack H Obama', 'Barack H. Obama')
# 92

Here we observe the opposite behaviour: different variations in Barack Obama’s name produce a lower score for the partial ratio, why is that? Probably because the extra token for the middle name is right in the middle of the string. For this particular case, we can benefit by other functions that tokenise the string and treat it as a set or as a sequence of words:

fuzz.token_sort_ratio('Barack Obama', 'Barack H. Obama')
# 92
fuzz.token_set_ratio('Barack Obama', 'Barack H. Obama')
# 100

fuzz.token_sort_ratio('Barack H Obama', 'Barack H. Obama')
# 100
fuzz.token_set_ratio('Barack H Obama', 'Barack H. Obama')
# 100

The token_* functions split the string on white-spaces, lowercase everything and get rid of non-alpha non-numeric characters, which means punctuation is ignored (as well as weird unicode symbols).

In case we have a list of options and we want to find the closest match(es), we can use the process module:

query = 'Barack Obama'
choices = ['Barack H Obama', 'Barack H. Obama', 'B. Obama']
# Get a list of matches ordered by score, default limit to 5
process.extract(query, choices)
# [('Barack H Obama', 95), ('Barack H. Obama', 95), ('B. Obama', 85)]

# If we want only the top one
process.extractOne(query, choices)
# ('Barack H Obama', 95)

Summary

This article has introduced Fuzzy String Matching, which is a well understood problem with some interesting practical applications.

Python has a very simple option to tackle the problem: the FuzzyWuzzy library, which is built on top of difflib (and python-Levenshtein for speed). It can take a while to figure out how to scope our string matching problem, but the easy interface of fuzzywuzzy should help speeding up the development.

Published by

Marco

Data Scientist View all posts by Marco

29 thoughts on “Fuzzy String Matching in Python”

Kite says:

June 8, 2015 at 12:14 pm

Interesting article.

I am trying to use it but the process matching sound really weird…
My query is : “Eclair Chocolat”
Choices are “[‘chou’, ‘chouchou’]”
And i got 67% similarities… What the… ?!

Otherwise, it is fun to use :)

LikeLiked by 2 people

Reply
1. Marco says:
  
  June 8, 2015 at 5:51 pm
  
  Thanks Kite – you can see the logic behind that number in fuzz.WRatio, i.e. https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/fuzz.py#L236
  
  LikeLiked by 1 person
  
  Reply
Pingback: Fuzzy Matching for Oil/Gas MDM using Python | Adventures in Python, Petroleum Data & GIS
Pingback: Partial String Matching in R and Python Part II | If so, how?
Malik says:

August 16, 2015 at 5:54 pm

Right away, I got some different results following your examples.
The first one:
In [6]: fuzz.ratio(“ACME Factory”, “ACME Factory Inc.”)
Out[6]: 83 # you had 62
the fourth one:
In [10]: fuzz.partial_ratio(‘Barack H Obama’, ‘Barack H. Obama’)
Out[10]: 92 # you had 75

I’m using IPython 3.1.0 with Python 3.4 on Ubuntu 15.04. I don’t know what you are using but don’t think it should make this much difference, should it? In both cases, the difference is about 20 points, which has to be significant in any use case analysis, don’t you think? What accounts for the difference?

LikeLike

Reply
1. Marco says:
  
  August 17, 2015 at 11:53 am
  
  Thanks Malik,
  
  regarding the first example (ACME factory), you are correct, very likely to be a cut-and-paste error on my side at the time, so I fixed it.
  
  regarding the other examples with Barack Obama, please notice the “H” for the middle name, that’s what makes the difference, e.g.
  fuzz.partial_ratio(‘Barack H Obama’, ‘Barack H. Obama’) # 92
  fuzz.partial_ratio(‘Barack Obama’, ‘Barack H. Obama’) # 75
  
  LikeLike
  
  Reply
  1. Malik says:
    
    August 19, 2015 at 12:34 am
    
    How is this being used in the real world? Must the user provide the comparisons, or can it be attached to a search mechanism to find similar strings in the first place?
    
    LikeLike
  2. Marco says:
    
    August 20, 2015 at 8:43 am
    
    If you look at the blog post by SeatGeek linked above, you’ll have a nice example of a real use-case. In general, you’ll need to provide the strings for comparison and the algorithms will give you some sort of similarity score between two strings. How you interpret these scores is up to you (e.g. you can decide to look for some similarity threshold).
    
    LikeLiked by 1 person
jozvison says:

February 14, 2016 at 5:55 am

Reblogged this on jozvison.

LikeLike

Reply
Sarang says:

July 24, 2016 at 3:45 pm

Interesting article, thanks. I am looking at a scenario where you have a finite set of phrases but you have to find the closest phonetic match (something like the process.extract function above but the score is by the similarity in their pronunciation).
For eg. say the possible words are {pause, up, down, back, next, quit} and the word is boss then the best match should be pause.
I am working on a speech recognition project and the software sometimes interprets the spoken word to something different. Is there any python module that can accomplish this?

Thanks for your time in advance.
Sarang

LikeLike

Reply
1. Marco says:
  
  July 25, 2016 at 5:25 am
  
  Hi Sarang, speech recognition is not really my expertise so I’m afraid I can’t tell you much. There are a few libs for phonetic algorithms out there, but I don’t have a specific recommendation.
  Cheers,
  Marco
  
  LikeLike
  
  Reply
2. Gaurav Kotak says:
  
  June 28, 2017 at 12:30 am
  
  Sarang, did you by chance find any library that works here. I have a similar requirement. Thanks!
  
  LikeLiked by 1 person
  
  Reply
  1. Denis Passos says:
    
    June 29, 2017 at 5:11 am
    
    Hi, Gaurav. Today I was reading about a library called Jellyfish. Take a look: http://jellyfish.readthedocs.io/en/latest/
    
    LikeLike
expeditiontomyself says:

August 3, 2016 at 4:06 am

I am using fuzzywuzzy to match a file with 500k records to a catalog with 100k records. So far the performance has been very poor. Did you face any performance issues?

LikeLike

Reply
1. Marco says:
  
  August 4, 2016 at 5:42 am
  
  Hi,
  if by performance you mean speed, so far I’ve used it mainly in offline jobs, where speed was not a problem. But in principle, yes, fuzzy string matching can be computationally expensive. You might want to restrict the space of your problem first, e.g. find ways to avoid a brute force with 500k x 100k matches.
  If instead you mean accuracy, I’d say it depends on the application and the expected output.
  
  LikeLike
  
  Reply
  1. Luke says:
    
    June 10, 2020 at 10:19 am
    
    One Alternative I am using is https://pypi.org/project/rapidfuzz/, which implements the same algorithms, but is quite a bit faster than FuzzyWuzzy. Especially when using string preprocessing this should be done manually, so it is not performed multiple times for the same string.
    However having a 500k*100k comparisions will always take quite a while ;)
    
    LikeLike
Lucy Page says:

September 27, 2016 at 4:25 pm

Hi Marco,
Fuzzy wuzzy seems like a great package, and I am hoping to use it to do fuzzy matching of Indian district and village names in larger. I’m trying to use the partial_ratio function. Unfortunately, I’m running into some problems with that. Basically, I am getting matching scores (usually 50, though sometimes different) that are less than 100, even when the second string is contained in the first.

In particular,
>>> fuzz.partial_ratio(‘Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance -regarding,’,’Orissa’)
50
>>> fuzz.partial_ratio(‘Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clearance -regarding,’,’Barbil’)
50

But when I shave off some of the end of the first long string, I get 100% partial match:
>>> fuzz.partial_ratio(‘Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear’,’Orissa’)
100
>>>fuzz.partial_ratio(‘Subject: Dalki Manganese Ore Mine of M/S Bharat Process and Mechanical Engineers Ltd., Villages Dalki, Soyabahal, Sading and Thakurani R.F., Tehsil Barbil, Distt, Keonjhar, Orissa environmental clear’,’Barbil’)
100

Do you have any guidance as to what might be happening? Thank you very much!

LikeLike

Reply
Daniel Yehdego says:

February 10, 2017 at 8:33 pm

Hi Marco:

Thanks for this awesome article and walk-through. I’m wondering if this can be used in online or incremental clustering?

My idea is as follows:
I’ve google search short string phrases stored in Azure SQL DB. We add new search phrases at the end of the day as a batch process. Now we want to cluster the data so that similar string search phrases will be grouped into clusters and if there are new unique search phrases, then we add one more cluster.

Are there implementation in this area. Can be R/Python

Thanks
Daniel

LikeLike

Reply
1. Marco says:
  
  February 28, 2017 at 4:51 pm
  
  Hi Daniel
  it’s possible but potentially I’d expect some performance issues with larger data sets. I’m not aware of existing implementations I’m afraid.
  
  Cheers
  Marco
  
  LikeLike
  
  Reply
2. Alsa says:
  
  October 9, 2017 at 11:53 am
  
  Hey Daniel,
  may I know , if you could implement your idea. I am interested in it and I would like to use it in my project referring malware detection.
  I am looking forward to hearing from you.
  Alsa
  
  LikeLike
  
  Reply
Pingback: partial string match | Echo's World
karthik says:

July 1, 2017 at 7:28 am

Hi Marco,

Let say i am using fuzz.ratio(“7 from 7”, “great”) – it results in 15 percentage, how it could 15% is there any better algorithm to match the string.

LikeLike

Reply
1. ChrisC says:
  
  August 8, 2017 at 12:04 am
  
  “Return a measure of the sequences’ similarity as a float in the range [0, 1].
  
  Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1.0 if the sequences are identical, and 0.0 if they have nothing in common.”
  
  fuzzywuzzy uses difflib, looking at the ratio function here https://docs.python.org/2/library/difflib.html we see the above quote describing how ratio() works.
  
  13 elements by my count, the two r’s are the only matches, so 2/13, which is 15.384, which gets rounded down to 15%.
  
  LikeLiked by 1 person
  
  Reply
amirmohammad aliabadi says:

February 27, 2018 at 11:28 am

hi marco
thanks about this
I was able to implement fuzzywuzzy for Persian language and compare a large number of title of news and remove titles that had a high similarity. It was very useful to me

LikeLike

Reply
Pingback: Fuzzy String Matching in Python – Catherine Gitau
rubendramirez says:

June 5, 2018 at 9:23 am

You’re awesome! It works like a charm.

LikeLike

Reply
Kali Linux says:

June 22, 2018 at 7:43 pm

Hi;
Thanks good article

LikeLike

Reply
Gabriel Sarrazin says:

September 13, 2018 at 9:50 am

Thank you very much. Really clear tutorial.

LikeLike

Reply
Kuldeepak Sharma says:

February 13, 2019 at 4:41 pm

Thanks for the clear tutorial!

LikeLike

Reply