Programming for beginners: Elasticsearch: Dealing with human languages

While dealing with human languages, you have to consider synonyms, stop words, spelling mistakes etc.,

Language Analyzer

Elastic search provides various language analyzers to analyze particular language.

Some of the language analyzers ships with elastic search are:

arabic, armenian, basque, brazilian, bulgarian, catalan, chinese, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.

What a language analyzer offers to you?

All language analyzers provides below functionality in common.

1. Tokenizing words

For example the word "Experience life in all possible ways” is tokenized like below.

[“Experience”, “life”, “in”, “all”, “possible”, “ways”]

2. Lower case all tokens

Ouput of step1 will be converted like below.

[“experience”, “life”, “in”, “all”, “possible”, “ways”]

3. Remove stop words

“in” is a stop word, after removing this, result looks like below.

[“experience”, “life”, “all”, “possible”, “ways”]

4. Stem tokens to their root form.

“experience” ======> “experi”

“possible” ===========> “possibl”

“ways” ===========> “wai”

Let’s have some data in index “blog” and type “posts”.

PUT /blog
{
  "mappings": {
    "posts": {
      "properties": {
        "title": {
          "type":     "string",
          "analyzer": "english" 
        }
      }
    }
  }
}

POST /blog/_analyze?field=title
{
"Experience life in all possible ways"
}

You will get following response.

{
   "tokens": [
      {
         "token": "experi",
         "start_offset": 3,
         "end_offset": 13,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "life",
         "start_offset": 14,
         "end_offset": 18,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "all",
         "start_offset": 22,
         "end_offset": 25,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "possibl",
         "start_offset": 26,
         "end_offset": 34,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "wai",
         "start_offset": 35,
         "end_offset": 39,
         "type": "<ALPHANUM>",
         "position": 6
      }
   ]
}

Prevoius Next Home

Programming for beginners

Wednesday, 18 November 2015

Elasticsearch: Dealing with human languages

No comments:

Post a Comment