Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern.
The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match.
Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).
Fuzzy String Matching can have different practical applications. Typical examples are spell-checking, text re-use detection (the politically correct way of calling plagiarism detection), spam filtering, as well as several applications in the bioinformatics domain, e.g. matching DNA sequences.
This article plays around with fuzzywuzzy, a Python library for Fuzzy String Matching.
Getting Started with Fuzzywuzzy
FuzzyWuzzy has been developed and open-sourced by SeatGeek, a service to find sport and concert tickets. Their original use case, as discussed in their blog, was the problem given by the many different ways of labelling the same event, adding or hiding location, dates, venue, etc. This problem is also arising with different entities like persons or companies.
To install the library, you can use pip as usual:
pip install fuzzywuzzy
The main modules in FuzzyWuzzy are called fuzz, for string-to-string comparisons, and process to compare a string with a list of strings.
Under the hood, FuzzyWuzzy uses difflib, part of the standard library, so there is nothing extra to install. We can anyway benefit from the performance of python-Levenshtein for sequence matching, so let’s also install this library:
pip install python-Levenshtein
Examples of Usage
Firstly, let’s import the main modules:
from fuzzywuzzy import fuzz from fuzzywuzzy import process
In order to calculate a similarity score between two strings, we can use the methods ratio() or partial_ratio():
fuzz.ratio("ACME Factory", "ACME Factory Inc.") # 83 fuzz.partial_ratio("ACME Factory", "ACME Factory Inc.") # 100
We can see how the ratio() function is confused by the suffix “Inc.” used in company names, but really the two strings refer to the same entity. This is captured by the partial ratio.
fuzz.ratio('Barack Obama', 'Barack H. Obama') # 89 fuzz.partial_ratio('Barack Obama', 'Barack H. Obama') # 75 fuzz.ratio('Barack H Obama', 'Barack H. Obama') # 97 fuzz.partial_ratio('Barack H Obama', 'Barack H. Obama') # 92
Here we observe the opposite behaviour: different variations in Barack Obama’s name produce a lower score for the partial ratio, why is that? Probably because the extra token for the middle name is right in the middle of the string. For this particular case, we can benefit by other functions that tokenise the string and treat it as a set or as a sequence of words:
fuzz.token_sort_ratio('Barack Obama', 'Barack H. Obama') # 92 fuzz.token_set_ratio('Barack Obama', 'Barack H. Obama') # 100 fuzz.token_sort_ratio('Barack H Obama', 'Barack H. Obama') # 100 fuzz.token_set_ratio('Barack H Obama', 'Barack H. Obama') # 100
The token_* functions split the string on white-spaces, lowercase everything and get rid of non-alpha non-numeric characters, which means punctuation is ignored (as well as weird unicode symbols).
In case we have a list of options and we want to find the closest match(es), we can use the process module:
query = 'Barack Obama' choices = ['Barack H Obama', 'Barack H. Obama', 'B. Obama'] # Get a list of matches ordered by score, default limit to 5 process.extract(query, choices) # [('Barack H Obama', 95), ('Barack H. Obama', 95), ('B. Obama', 85)] # If we want only the top one process.extractOne(query, choices) # ('Barack H Obama', 95)
This article has introduced Fuzzy String Matching, which is a well understood problem with some interesting practical applications.
Python has a very simple option to tackle the problem: the FuzzyWuzzy library, which is built on top of difflib (and python-Levenshtein for speed). It can take a while to figure out how to scope our string matching problem, but the easy interface of fuzzywuzzy should help speeding up the development.