Wednesday 18 November 2015

Elasticsearch: Fuzzy query

Let’s say you want to build a search engine for your “e-commerce” site. You provide a search box, and ask users to search for the products in the search box. One problem here is, you can’t expect users to type search string correctly (I mean without spelling mistakes).

User can type “wasing machine” instead of “washing machine”, “camra” instead of “camera”. So it is your responsibility to deal with these kinds of typos. In this post and subsequent posts, I will explain how to deal with these typos.

Levenshtein distance
Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

For Example,
To change word “rat” to “cat” we need to substitute ‘r’ by ‘c’, so Levenshtein distance is 1.

Let’s change the string “kitten” to “sitting”
1.   “kitten” -------à “sitten” (Replace ‘k’ by ‘s’)
2.   “sitten” -------à “sittin” (Replace ‘e’ by ‘i')
3.   “sittin” ---------> “sitting” (Insertion of ‘g’ at end)

Levenshtein distance to change word “kitten” to “sitting” is 3.

Fuzzy Query : support for misspellings
Elastic search solves misspellings problem by using fuzzy query. The fuzzy query uses similarity based on Levenshtein edit distance for string fields, and a +/- margin on numeric and date fields.

PUT /blog/posts/_bulk
{ "index": { "_id": 1 }}
{ "text": "I love this camera"}
{ "index": { "_id": 2 }}
{ "text": "Cameron International"}
{ "index": { "_id": 3 }}
{ "text": "canara bank"}

GET /blog/posts/_search
{
 "query" : {
  "fuzzy" : {
   "text" : "camara"
  }
 }
}


You will get following results.

{
   "took": 38,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.19178301,
      "hits": [
         {
            "_index": "blog",
            "_type": "posts",
            "_id": "3",
            "_score": 0.19178301,
            "_source": {
               "text": "canara bank"
            }
         },
         {
            "_index": "blog",
            "_type": "posts",
            "_id": "1",
            "_score": 0.15342641,
            "_source": {
               "text": "I love this camera"
            }
         }
      ]
   }
}


As you observe search query, I searched for “camara”, “camara” can be changed to “canara” in single edit and “camara” can be changed to “camera” in single edit. So documents 1 and 3 written in response.

You can also add advanced options like below.

GET /blog/posts/_search
{
 "query" : {
  "fuzzy" : {
   "text" :{
     "value" : "aara",
     "boost" : 1.0,
        "fuzziness" : 2,
        "prefix_length" : 0,
        "max_expansions": 100
   }
  }
 }
}


fuzziness : The maximum edit distance. Defaults to AUTO.

prefix_length : The number of initial characters which will not be “fuzzified”. This helps to reduce the number of terms which must be examined. Defaults to 0.

max_expansions : The maximum number of terms that the fuzzy query will expand to. Defaults to 50.

Above query returns following result.

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.19178301,
      "hits": [
         {
            "_index": "blog",
            "_type": "posts",
            "_id": "3",
            "_score": 0.19178301,
            "_source": {
               "text": "canara bank"
            }
         }
      ]
   }
}


You can apply fuzzy logic to numeric fields also.

{
    "fuzzy" : {
        "price" : {
            "value" : 12,
            "fuzziness" : 2
        }
    }
}


Above query matches price from 10(12-2) to 14(12+2).






Prevoius                                                 Next                                                 Home

No comments:

Post a Comment