ICU
tokenizer is just like standard tokenizer, it adds better support for some of
the Asian languages. It is a third party library.
How to install ICU Tokenizer
Step 1: Stop instance of elastic search
Step 2: Goto installation directory of elastic search and
run command like below.
plugin
-install elasticsearch/elasticsearch-analysis-icu/2.7.0
$ plugin -install elasticsearch/elasticsearch-analysis-icu/2.7.0 -> Installing elasticsearch/elasticsearch-analysis-icu/2.7.0... Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-analysis-icu/elasticsearch-analysis-icu-2.7.0.zip... Downloading.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................DONE Installed elasticsearch/elasticsearch-analysis-icu/2.7.0 into /Users/harikrishna_gurram/softwares/elasticsearch-1.7.1/plugins/analysis-
Note:
You can get
the information about latest version of ICU tokenizer from
Step 3: Restart elastic search.
Let’s see an
example.
“PTR is my
best friend” is represented in chinese like below.
PTR是我最好的朋友
By using
standard analyzer.
GET /_analyze?tokenizer=standard
{
"PTR是我最好的朋友"
}
You will get
following response.
{
"tokens": [
{
"token": "PTR",
"start_offset": 3,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "是",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "我",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "最",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "好",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "的",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "朋",
"start_offset": 11,
"end_offset": 12,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"token": "友",
"start_offset": 12,
"end_offset": 13,
"type": "<IDEOGRAPHIC>",
"position": 8
}
]
}
By using ICU
tokenizer.
GET /_analyze?tokenizer=icu_tokenizer
{
"PTR是我最好的朋友"
}
You will get
following response.
{
"tokens": [
{
"token": "PTR",
"start_offset": 3,
"end_offset": 6,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "是",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "我",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "最好的",
"start_offset": 8,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "朋友",
"start_offset": 11,
"end_offset": 13,
"type": "<IDEOGRAPHIC>",
"position": 5
}
]
}
ICU tokenizer uses dictionary based approach to identify word in some Asian languages.
No comments:
Post a Comment