Autocomplete and the Three Algorithms

Drew Miller

A lot of our projects have some form of autocomplete functionality baked into them; but recently we had a client ask for something a little more mistake proof. A lot of users in this particular application will be searching for something based on the sound of a word and not really know the exact spelling.

In talking over examples of how the new autocomplete should match, the product owner sent us in the direction of a possible solution he'd had some experience with in the past: the Soundex algorithm.

This one is too cold

To give a little more context, this particular application is hosted on Heroku and uses a PostgreSQL database for persistent storage. Given that, we started by seeing what PostgreSQL can do for us. As it turns out, it's got support for Soundex right out of the box.

The Soundex system is a method of matching similar-sounding names by converting them to the same code. It was initially used by the United States Census in 1880, 1900, and 1910. Note that Soundex is not very useful for non-English names.

That might be exactly what another application needed, but we aren't dealing with people's names, we're dealing with brand names. Time to break out our SQL-fu and see how the results hold up.

First things first, you'll need to have the fuzzystrmatch module installed on PostgreSQL database. You can do that by dropping into the PSQL console and running: CREATE EXTENSION fuzzystrmatch;. Now in a query you can run something a query to get some results:

Looking those over for our data-set, we aren't getting at all the results we'd want. Time to start looking at other options…

This one is too hot

A little farther down that page of PostgreSQL documentation is the Levenshtein distance algorithm.

This function calculates the Levenshtein distance between two strings.

That sounds promising, let's give that a try. The Levenshtein function comes packaged in the fuzzystrmatch module with Soundex, so no need to go back into the PSQL console, instead we'll start with a query:

It wasn't immediately clear how Levenstein was sorting the results. But further investigations revealed that output of the function is the edit distance between the two strings; which is to say: how many changes you need to make to the get from one string to the other.

Once again, the results from this aren't really what we're looking for, so it's back to the drawing board…

This one is just right

After a broader search, we came upon the pg_trgm module that adds Trigram similarity to PostgreSQL.

A trigram is a group of three consecutive characters taken from a string. We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages.

That sounds much more promising, let's get to the query and see what results we get. Similar to the other two algorithms, we'll have to add the module to PostgreSQL first by running CREATE EXTENSION pg_trgm; in the PSQL console. From there, we can hop back to the query interface:

We've got results that match up to our use case. A search for "hazienda" matches first to the brand named "Hacienda", and then to some other sensible results. Perfect!

Now we just need to step back into our Rails API server and get this query running. That should be easy, right?

The bears came home, and they were upset

Perhaps incorrectly, our first instinct was to try and avoid reinventing the wheel. So we did some searching for reputable gems that wrap the pg_trgm module's Trigram search functionality and came across textacular and pg_search.

If you couldn't tell from the section header, some things went wrong here, but before we get into that let me say that both of these gems are awesome. They've got test coverage, quality metrics, semantic versioning, multiple contributors, etc. In short, they've got all the indicators of healthy, vibrant repositories.

That said, let's talk about what went wrong. Of the two gems, textacular is much more straightforward (though not as feature rich) and has all the functionality we want. The fuzzy_search finder it adds is perfect, we should just be able to call from our Brand model like Brand.fuzzy_search(name: 'hazienda').limit(10) and we should get our Trigram similarity matches.

The catch is, textacular adds in a WHERE clause that was incompatible with our use case. The clause they put in makes sure the fuzzy search term is a substring in the column before running the similarity match. This is probably a performance optimization for large tables and complex combination queries, but for our purposes the raw query was fast enough and the substring match killed our ability to catch spelling mistakes.

Goldilocks made her own porridge

In the end, we decided to roll our own. Following one of the wonderful model refactoring patterns from Code Climate (#4), we extracted it as a query object: