Thursday, 19 November 2015

Elasticsearch: ICU Tokenizer

ICU tokenizer is just like standard tokenizer, it adds better support for some of the Asian languages. It is a third party library.

How to install ICU Tokenizer
Step 1: Stop instance of elastic search


Step 2: Goto installation directory of elastic search and run command like below.
plugin -install elasticsearch/elasticsearch-analysis-icu/2.7.0
$ plugin -install elasticsearch/elasticsearch-analysis-icu/2.7.0
-> Installing elasticsearch/elasticsearch-analysis-icu/2.7.0...
Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-analysis-icu/elasticsearch-analysis-icu-2.7.0.zip...
Downloading.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................DONE
Installed elasticsearch/elasticsearch-analysis-icu/2.7.0 into /Users/harikrishna_gurram/softwares/elasticsearch-1.7.1/plugins/analysis-

Note:
You can get the information about latest version of ICU tokenizer from

Step 3: Restart elastic search.
Let’s see an example.

“PTR is my best friend” is represented in chinese like below.
PTR是我最好的朋友


By using standard analyzer.

GET /_analyze?tokenizer=standard
{
"PTR是我最好的朋友"
}


You will get following response.

{
   "tokens": [
      {
         "token": "PTR",
         "start_offset": 3,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "是",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<IDEOGRAPHIC>",
         "position": 2
      },
      {
         "token": "我",
         "start_offset": 7,
         "end_offset": 8,
         "type": "<IDEOGRAPHIC>",
         "position": 3
      },
      {
         "token": "最",
         "start_offset": 8,
         "end_offset": 9,
         "type": "<IDEOGRAPHIC>",
         "position": 4
      },
      {
         "token": "好",
         "start_offset": 9,
         "end_offset": 10,
         "type": "<IDEOGRAPHIC>",
         "position": 5
      },
      {
         "token": "的",
         "start_offset": 10,
         "end_offset": 11,
         "type": "<IDEOGRAPHIC>",
         "position": 6
      },
      {
         "token": "朋",
         "start_offset": 11,
         "end_offset": 12,
         "type": "<IDEOGRAPHIC>",
         "position": 7
      },
      {
         "token": "友",
         "start_offset": 12,
         "end_offset": 13,
         "type": "<IDEOGRAPHIC>",
         "position": 8
      }
   ]
}


By using ICU tokenizer.

GET /_analyze?tokenizer=icu_tokenizer
{
"PTR是我最好的朋友"
}


You will get following response.

{
   "tokens": [
      {
         "token": "PTR",
         "start_offset": 3,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "是",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<IDEOGRAPHIC>",
         "position": 2
      },
      {
         "token": "我",
         "start_offset": 7,
         "end_offset": 8,
         "type": "<IDEOGRAPHIC>",
         "position": 3
      },
      {
         "token": "最好的",
         "start_offset": 8,
         "end_offset": 11,
         "type": "<IDEOGRAPHIC>",
         "position": 4
      },
      {
         "token": "朋友",
         "start_offset": 11,
         "end_offset": 13,
         "type": "<IDEOGRAPHIC>",
         "position": 5
      }
   ]
}


ICU tokenizer uses dictionary based approach to identify word in some Asian languages.



Prevoius                                                 Next                                                 Home

No comments:

Post a Comment