Improving profile search accuracy using ElasticSearch n-gram tokenizer
What is Elasticsearch?
Elasticsearch is a distributed document store that stores data in an inverted index. An inverted index lists every unique word that appears in any document and identifies all of the documents each word occurs in. Elasticsearch indexes the document and claims that the document is fully searchable in near real-time (within 1 second).
What is the process of storing data in Elasticsearch?
Every document is analyzed when adding or updating in ES.
What is an analyzer and how does an analyzer work?
- When indexing a document, its full-text fields are run through an analysis process
- The analysis process includes tokenizing and normalizing a block of text.
- We, as a customer, always have control over the analysis process otherwise we always have a standard analyzer.
An analyzer consists of three things:
- Character filter: It converts HTML styling (bold, italic, etc) to normal text and then passes the text to the tokenizer.
- Tokeniser: Tokeniser creates tokens from the text. We have different kinds of tokenizers like ‘standard’ which split the text by whitespace as well as remove the symbols like $,%,@,#, etc which do not have much importance in search. But if we want to keep these symbols as it is we can use ‘whitespace’ tokenizer which will just split the text by whitespace and nothing else.
- Token filter — Once the tokenizer creates the token from the text, these tokens are received by the token filter. There are different kinds of token filters we can use according to our use cases. For eg., the ‘lowercase’ token filter converts the uppercase alphabets to lowercase, ‘stop’ token filter filters out some common words like ‘is’, ‘the’, ‘an’ etc which do not help in improving search results. Similarly, we have a ‘synonym’ token filter which helps in finding similar words. (eg. ‘lowercase’, ‘stop’, ‘synonym’). We can use multiple token filters at once.
Flow of analyzer can be seen in the diagram below -
Example of ‘standard’ tokenizer. It has split the text by whitespace as well as removed the ‘!’ from the bhatia
Example of ‘whitespace’ tokenizer, it will just split the text and keep the symbols as it is. (‘!’ in this case).
Example of a token filter. Here we are using two token filters at the same time. ‘Lowercase’ has converted uppercase alphabets to lowercase as we can see in the first token ‘anmol’ and ‘stop’ token has filtered out common words like ‘is’ and ‘for’ from the list of tokens.
What is Index in Elasticsearch?
An index is like a ‘database’ in a relational database. It has a mapping that defines multiple types.
Analogy => MySQL => Databases => Tables => Columns/Rows
Elasticsearch => Indices => Types => Documents with Properties
When we create an index, we can define how many shards and replicas it will have. By default, each index in Elasticsearch is allocated 5 primary Shards and 1 replica which means that if you have at least two nodes in your cluster, your index will have 5 primary shards and another 5 replica shards (1 complete replica) for a total of 10 shards per index.
What is an inverted index?
- To understand the inverted index, first, we need to know the forward index (or just index). An index maps a document to all the words it contains. So, similarly inverted index maps word to all the documents which contain that particular word.
- The purpose of an inverted index is to store text in a structure that allows for very efficient and fast full-text searches.
- When performing full-text searches, we are actually querying an inverted index and not the JSON documents that we defined when indexing the documents.
Fuzzy matching in ES :
- Returns documents that contain terms similar to the search term are measured by a Levenshtein edit distance. The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.
Changing a character (box → fox)
Removing a character (black → lack)
Inserting a character (sic → sick)
Transposing two adjacent characters (act → cat)
What was the problem we faced?
- One of our onboarded creators ‘Anmol Bhatia’ was not showing up in the search results of the search string ‘anmol’, ‘bhatia’, ‘anmol bhaita’ (even after the boost we have given to the verified profiles).
Why was this happening?
- Because the username ‘anmolbhatia’ as well as the handle ‘anmolbhatia_’ do not have any space and we were using ‘standard’ tokenizer and as we know ‘standard’ tokenizer splits the text by whitespace so elasticsearch did not create any token for her first name as well as last name, so when someone was searching for ‘anmol’, her handle was not mapped to this particular token and because of this she was not showing up in the search result.
- We came up with a solution ‘n-gram’ tokenizer.
What is an n-gram tokenizer?
- The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits n-grams of each word of the specified length.
- With the default settings, the ngram tokenizer treats the initial text as a single token and produces N-grams with minimum length 1 and maximum length 2.
How did n-gram solve our problem?
- With n-gram, we created tokens of length 5.
- For ‘anmolbhatia’ we got these tokens -
Now, ‘anmol’ is also mapped to this creator’s profile, so when someone now searches for ‘anmol’, this profile shows up in the search result.
How did we implement n-gram?
- We made a new index named ‘user-ngram’ in the same profile ES and enabled dual writes to both indices
- Then data was migrated to a new index.
- A new query was written as per n-gram.
Improvements in profile opened for a search
We saw a relative growth of 5% on the search suggestions clicked for the profile.
Now we can see more relevant search as per search query as compared to the previous standard tokenizer. And thus users find more relevant search suggestions and thus have more engagement on the search.