Let’s say
you want to build a search engine for your “e-commerce” site. You provide a
search box, and ask users to search for the products in the search box. One problem here
is, you can’t expect users to type search string correctly (I mean without
spelling mistakes).
User can
type “wasing machine” instead of “washing machine”, “camra” instead of
“camera”. So it is your responsibility to deal with these kinds of typos. In
this post and subsequent posts, I will explain how to deal with these typos.
Levenshtein distance
Levenshtein
distance between two words is the minimum number of single-character edits
(i.e. insertions, deletions or substitutions) required to change one word into
the other.
For Example,
To change
word “rat” to “cat” we need to substitute ‘r’ by ‘c’, so Levenshtein distance
is 1.
Let’s change
the string “kitten” to “sitting”
1.
“kitten”
-------à “sitten” (Replace ‘k’ by ‘s’)
2.
“sitten”
-------à “sittin” (Replace ‘e’ by ‘i')
3.
“sittin”
---------> “sitting” (Insertion of ‘g’ at end)
Levenshtein
distance to change word “kitten” to “sitting” is 3.
Fuzzy Query : support for misspellings
Elastic
search solves misspellings problem by using fuzzy query. The fuzzy query uses
similarity based on Levenshtein edit distance for string fields, and a +/-
margin on numeric and date fields.
PUT /blog/posts/_bulk { "index": { "_id": 1 }} { "text": "I love this camera"} { "index": { "_id": 2 }} { "text": "Cameron International"} { "index": { "_id": 3 }} { "text": "canara bank"}
GET /blog/posts/_search { "query" : { "fuzzy" : { "text" : "camara" } } }
You will get
following results.
{ "took": 38, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 0.19178301, "hits": [ { "_index": "blog", "_type": "posts", "_id": "3", "_score": 0.19178301, "_source": { "text": "canara bank" } }, { "_index": "blog", "_type": "posts", "_id": "1", "_score": 0.15342641, "_source": { "text": "I love this camera" } } ] } }
As you
observe search query, I searched for “camara”, “camara” can be changed to
“canara” in single edit and “camara” can be changed to “camera” in single edit.
So documents 1 and 3 written in response.
You can also
add advanced options like below.
GET /blog/posts/_search { "query" : { "fuzzy" : { "text" :{ "value" : "aara", "boost" : 1.0, "fuzziness" : 2, "prefix_length" : 0, "max_expansions": 100 } } } }
fuzziness : The maximum edit distance. Defaults to AUTO.
prefix_length : The number of initial characters which will not be
“fuzzified”. This helps to reduce the number of terms which must be examined.
Defaults to 0.
max_expansions : The maximum number of terms that the fuzzy query
will expand to. Defaults to 50.
Above query
returns following result.
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.19178301, "hits": [ { "_index": "blog", "_type": "posts", "_id": "3", "_score": 0.19178301, "_source": { "text": "canara bank" } } ] } }
You can
apply fuzzy logic to numeric fields also.
{ "fuzzy" : { "price" : { "value" : 12, "fuzziness" : 2 } } }
Above query
matches price from 10(12-2) to 14(12+2).
No comments:
Post a Comment