ICU
tokenizer is just like standard tokenizer, it adds better support for some of
the Asian languages. It is a third party library.
How to install ICU Tokenizer
Step 1: Stop instance of elastic search
Step 2: Goto installation directory of elastic search and
run command like below.
plugin
-install elasticsearch/elasticsearch-analysis-icu/2.7.0
$ plugin -install elasticsearch/elasticsearch-analysis-icu/2.7.0 -> Installing elasticsearch/elasticsearch-analysis-icu/2.7.0... Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-analysis-icu/elasticsearch-analysis-icu-2.7.0.zip... Downloading.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................DONE Installed elasticsearch/elasticsearch-analysis-icu/2.7.0 into /Users/harikrishna_gurram/softwares/elasticsearch-1.7.1/plugins/analysis-
Note:
You can get
the information about latest version of ICU tokenizer from
Step 3: Restart elastic search.
Let’s see an
example.
“PTR is my
best friend” is represented in chinese like below.
PTR是我最好的朋友
By using
standard analyzer.
GET /_analyze?tokenizer=standard { "PTR是我最好的朋友" }
You will get
following response.
{ "tokens": [ { "token": "PTR", "start_offset": 3, "end_offset": 6, "type": "<ALPHANUM>", "position": 1 }, { "token": "是", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "我", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "最", "start_offset": 8, "end_offset": 9, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "好", "start_offset": 9, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "的", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "朋", "start_offset": 11, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 7 }, { "token": "友", "start_offset": 12, "end_offset": 13, "type": "<IDEOGRAPHIC>", "position": 8 } ] }
By using ICU
tokenizer.
GET /_analyze?tokenizer=icu_tokenizer { "PTR是我最好的朋友" }
You will get
following response.
{ "tokens": [ { "token": "PTR", "start_offset": 3, "end_offset": 6, "type": "<ALPHANUM>", "position": 1 }, { "token": "是", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "我", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "最好的", "start_offset": 8, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "朋友", "start_offset": 11, "end_offset": 13, "type": "<IDEOGRAPHIC>", "position": 5 } ] }
ICU tokenizer uses dictionary based approach to identify word in some Asian languages.
No comments:
Post a Comment