Let’s say
you want to build a search engine for your “e-commerce” site. You provide a
search box, and ask users search for the products in the box. One problem here
is, you can’t expect users to type search string correctly (I mean without
spelling mistakes).
User can
type “wasing machine” instead of “washing machine”, “camra” instead of
“camera”. So it is your responsibility to deal with these kinds of typos. In
this post and subsequent posts, I will explain how to deal with these typos.
Levenshtein distance
Levenshtein
distance between two words is the minimum number of single-character edits
(i.e. insertions, deletions or substitutions) required to change one word into
the other.
For Example,
To change
word “rat” to “cat” we need to substitute ‘r’ by ‘c’, so Levenshtein distance
is 1.
Let’s change
the string “kitten” to “sitting”
1.
“kitten”
-------à “sitten” (Replace ‘k’ by ‘s’)
2.
“sitten”
-------à “sittin” (Replace ‘e’ by ‘i')
3.
“sittin”
---------> “sitting” (Insertion of ‘g’ at end)
Levenshtein
distance to change word “kitten” to “sitting” is 3.
Fuzzy Query : support for misspellings
Elastic
search solves misspellings problem by using fuzzy query.
PUT /blog/posts/_bulk { "index": { "_id": 1 }} { "text": "I love this camera"} { "index": { "_id": 2 }} { "text": "Cameron International"} { "index": { "_id": 3 }} { "text": "canara bank"}
GET /blog/posts/_search { "query" : { "fuzzy" : { "text" : "camara" } } }
You will get
following results.
{ "took": 38, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 0.19178301, "hits": [ { "_index": "blog", "_type": "posts", "_id": "3", "_score": 0.19178301, "_source": { "text": "canara bank" } }, { "_index": "blog", "_type": "posts", "_id": "1", "_score": 0.15342641, "_source": { "text": "I love this camera" } } ] } }
As you
observe search query, I searched for “camara”. “camara” can be changed to
“canara” in single edit and “camara” can be changed to “camera” in single edit.
So documents 1 and 3 written in response.
No comments:
Post a Comment