Fuzzy / Partial Text search in MongoDB using n-grams
Level: intermediate
Pre-requisites: mongodb CRUD
operations, $text
operator in mongodb, indexing
queries
Search Techniques
Consider the following the sneaky fox
JSON document:
To really put this problem statement into perspective, let’s look at the two types of search techniques:
- Full Text Search - Will return results that contain full and exact matches of the search query. For example, the sneaky fox document would be returned in the results only if we search for
sneaky
orhouse
orpasta
, but not forpas
,hou
,ouse
,hou
,sne
and so on, so forth. - Fuzzy / Partial Text Search - Will return results that contain partial matches of the search query. Now, the sneaky fox document would be returned in the results for all the search queries stated in the previous definition, including the partial ones.
MongoDB’s default text search functionality matches entire words only, i.e. the full text search. Now for the ones who really need a partial text search functionality, there are various ways to achieve this:
- Use a solution like Elastic search for all data reading operations. Elastic search, umm is a search server, and caches a copy of the data for faster retrievals and search
- Use regular expressions to search, which are memory expensive, and on larger datasets would do more harm than good
- Or do a custom implementation of Partial Text search
Search Example
Let’s first walk through the example provided in the Offical MongoDB docs
- Run the following in your Mongo shell to insert the documents in database:
- To use text search, we need to create a text index:
- Now, let’s try some search queries:
As we followed through the examples, it is clear that partial text search does not work in MongoDB.
n-grams
Enter n-grams
, whose Wikipedia definition states:
an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application.
Furthermore,
Using Latin numerical prefixes, an n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. English cardinal numbers are sometimes used, e.g., “four-gram”, “five-gram”, and so on. In computational biology, a polymer or oligomer of a known size is called a k-mer instead of an n-gram, with specific names using Greek numerical prefixes such as “monomer”, “dimer”, “trimer”, “tetramer”, “pentamer”, etc., or English cardinal numbers, “one-mer”, “two-mer”, “three-mer”, etc.
For example,
- 1-grams (unigrams) of
coffee
would be["c", "o", "f", "f", "e", "e"]
. - 2-grams (bigrams) of
coffee
would be["co", "of", "ff", "fe", "ee"]
. - 3-grams (trigrams) of
coffee
would be["cof", "off", "ffe", "fee"]
.
You can play with the concept in the embedded notebook or at this NodeJS playground
I hope you are with me till yet. If not, scroll back up, and go through it again. We are going to jump into some code now.
Building our N-gram generator
So, now we are trying to break the text itself into pieces so that the search algorithm will match these n-grams itself. Let’s write a function to that:
- takes an array of objects as an argument
- takes an array of keys whose values will be the search space, e.g.
['name', 'description']
in our stores database - takes a number as an argument for the minimum number of searchable characters, (we will set this as 2 by default)
- return the array with additional
searchText
field containing the n-grams
Pause for a moment and think why the third argument is necessary here
Play around with the function here.
Code Snippet Explanation
- We start looping through the array of objects
arr
- Then we loop through the array of keys which are to be included in the
searchText
- We add all the unique words of the search space to a Set
- Next, we loop on the contents of the set to generate n-grams for each word in the Set and add them to another Set, (say S2) so as to avoid duplicates and save space
- We now extract all the words from S2 in an array and concatenate into a string
- End Loop in Step 2.
- End Loop in Step 1.
Testing our search implementation
Now let us update the documents in our database using the following queries so that they contain the searchText
:
Before we go and try to search through the documents, it is important to drop the previous index that was created and create a new one on the searchText
field.
- Get all indexes information by executing
db.stores.getIndexes()
, look for the index which has weights forname
anddescription
which we created initially and copy its name, in my case it was name_text_description_text. - Drop the index and create a new index on the field
searchText
:
Let us try the partial text search queries we attempted earlier, also add the sweet projection { searchText: 0 }
for a tidier output:
That’s it! We have a working partial text search implementation.
Epilogue
Following through, you must have realised that:
- this will be heavy on the storage requirements
- the n-gram generation and processing logic can be implemented as a middleware (pre-save and pre-update hooks) in your ORM / driver code
- will introduce additional latencies in write operations if implemented as a middleware
- will be better than setting up another search server though
Another gotcha is the order of the search results, which can be improved on playing with the n-gram generation logic in the code snippet that we constructed earlier.