Saturday 21 November 2015

Elasticsearch: Deal with typos and misspellings

Let’s say you want to build a search engine for your “e-commerce” site. You provide a search box, and ask users search for the products in the box. One problem here is, you can’t expect users to type search string correctly (I mean without spelling mistakes).

User can type “wasing machine” instead of “washing machine”, “camra” instead of “camera”. So it is your responsibility to deal with these kinds of typos. In this post and subsequent posts, I will explain how to deal with these typos.

Levenshtein distance
Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.

For Example,
To change word “rat” to “cat” we need to substitute ‘r’ by ‘c’, so Levenshtein distance is 1.

Let’s change the string “kitten” to “sitting”
1.   “kitten” -------à “sitten” (Replace ‘k’ by ‘s’)
2.   “sitten” -------à “sittin” (Replace ‘e’ by ‘i')
3.   “sittin” ---------> “sitting” (Insertion of ‘g’ at end)

Levenshtein distance to change word “kitten” to “sitting” is 3.

Fuzzy Query : support for misspellings

Elastic search solves misspellings problem by using fuzzy query.

PUT /blog/posts/_bulk
{ "index": { "_id": 1 }}
{ "text": "I love this camera"}
{ "index": { "_id": 2 }}
{ "text": "Cameron International"}
{ "index": { "_id": 3 }}
{ "text": "canara bank"}


GET /blog/posts/_search
{
 "query" : {
  "fuzzy" : {
   "text" : "camara"
  }
 }
}


You will get following results.

{
   "took": 38,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.19178301,
      "hits": [
         {
            "_index": "blog",
            "_type": "posts",
            "_id": "3",
            "_score": 0.19178301,
            "_source": {
               "text": "canara bank"
            }
         },
         {
            "_index": "blog",
            "_type": "posts",
            "_id": "1",
            "_score": 0.15342641,
            "_source": {
               "text": "I love this camera"
            }
         }
      ]
   }
}


As you observe search query, I searched for “camara”. “camara” can be changed to “canara” in single edit and “camara” can be changed to “camera” in single edit. So documents 1 and 3 written in response.




Prevoius                                                 Next                                                 Home

No comments:

Post a Comment