While
dealing with human languages, you have to consider synonyms, stop words,
spelling mistakes etc.,
Language Analyzer
Elastic
search provides various language analyzers to analyze particular language.
Some of the
language analyzers ships with elastic search are:
arabic, armenian, basque, brazilian, bulgarian,
catalan, chinese, cjk, czech, danish, dutch, english, finnish, french,
galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian,
norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish,
turkish, thai.
What a language analyzer offers to you?
All language
analyzers provides below functionality in common.
1. Tokenizing words
For example
the word "Experience life in all possible ways” is tokenized like below.
[“Experience”,
“life”, “in”, “all”, “possible”, “ways”]
2. Lower case all tokens
Ouput of
step1 will be converted like below.
[“experience”,
“life”, “in”, “all”, “possible”, “ways”]
3. Remove stop words
“in” is a
stop word, after removing this, result looks like below.
[“experience”,
“life”, “all”, “possible”, “ways”]
4. Stem tokens to their root form.
“experience”
======> “experi”
“possible”
===========> “possibl”
“ways”
===========> “wai”
Let’s have
some data in index “blog” and type “posts”.
PUT /blog { "mappings": { "posts": { "properties": { "title": { "type": "string", "analyzer": "english" } } } } }
POST /blog/_analyze?field=title { "Experience life in all possible ways" }
You will get following response.
{ "tokens": [ { "token": "experi", "start_offset": 3, "end_offset": 13, "type": "<ALPHANUM>", "position": 1 }, { "token": "life", "start_offset": 14, "end_offset": 18, "type": "<ALPHANUM>", "position": 2 }, { "token": "all", "start_offset": 22, "end_offset": 25, "type": "<ALPHANUM>", "position": 4 }, { "token": "possibl", "start_offset": 26, "end_offset": 34, "type": "<ALPHANUM>", "position": 5 }, { "token": "wai", "start_offset": 35, "end_offset": 39, "type": "<ALPHANUM>", "position": 6 } ] }
No comments:
Post a Comment