Wednesday 18 November 2015

Elasticsearch: Dealing with human languages

While dealing with human languages, you have to consider synonyms, stop words, spelling mistakes etc.,

Language Analyzer
Elastic search provides various language analyzers to analyze particular language.

Some of the language analyzers ships with elastic search are:
arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.

What a language analyzer offers to you?
All language analyzers provides below functionality in common.

1. Tokenizing words
For example the word "Experience life in all possible ways” is tokenized like below.

[“Experience”, “life”, “in”, “all”, “possible”, “ways”]

2. Lower case all tokens
Ouput of step1 will be converted like below.

[“experience”, “life”, “in”, “all”, “possible”, “ways”]

3. Remove stop words
“in” is a stop word, after removing this, result looks like below.
[“experience”, “life”, “all”, “possible”, “ways”]

4. Stem tokens to their root form.
“experience” ======> “experi”
“possible” ===========> “possibl”
“ways” ===========> “wai”
Let’s have some data in index “blog” and type “posts”.

PUT /blog
{
  "mappings": {
    "posts": {
      "properties": {
        "title": {
          "type":     "string",
          "analyzer": "english" 
        }
      }
    }
  }
}
POST /blog/_analyze?field=title
{
"Experience life in all possible ways"
}

You will get following response.
{
   "tokens": [
      {
         "token": "experi",
         "start_offset": 3,
         "end_offset": 13,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "life",
         "start_offset": 14,
         "end_offset": 18,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "all",
         "start_offset": 22,
         "end_offset": 25,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "possibl",
         "start_offset": 26,
         "end_offset": 34,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "wai",
         "start_offset": 35,
         "end_offset": 39,
         "type": "<ALPHANUM>",
         "position": 6
      }
   ]
}









Prevoius                                                 Next                                                 Home

No comments:

Post a Comment