Programming for beginners: Elasticsearch: ICU Tokenizer

ICU tokenizer is just like standard tokenizer, it adds better support for some of the Asian languages. It is a third party library.

How to install ICU Tokenizer

Step 1: Stop instance of elastic search

Step 2: Goto installation directory of elastic search and run command like below.

plugin -install elasticsearch/elasticsearch-analysis-icu/2.7.0

$ plugin -install elasticsearch/elasticsearch-analysis-icu/2.7.0
-> Installing elasticsearch/elasticsearch-analysis-icu/2.7.0...
Trying http://download.elasticsearch.org/elasticsearch/elasticsearch-analysis-icu/elasticsearch-analysis-icu-2.7.0.zip...
Downloading.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................DONE
Installed elasticsearch/elasticsearch-analysis-icu/2.7.0 into /Users/harikrishna_gurram/softwares/elasticsearch-1.7.1/plugins/analysis-

Note:

You can get the information about latest version of ICU tokenizer from

https://github.com/elasticsearch/elasticsearch-analysis-icu

Step 3: Restart elastic search.

Let’s see an example.

“PTR is my best friend” is represented in chinese like below.

PTR是我最好的朋友

By using standard analyzer.

GET /_analyze?tokenizer=standard
{
"PTR是我最好的朋友"
}

You will get following response.

{
   "tokens": [
      {
         "token": "PTR",
         "start_offset": 3,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "是",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<IDEOGRAPHIC>",
         "position": 2
      },
      {
         "token": "我",
         "start_offset": 7,
         "end_offset": 8,
         "type": "<IDEOGRAPHIC>",
         "position": 3
      },
      {
         "token": "最",
         "start_offset": 8,
         "end_offset": 9,
         "type": "<IDEOGRAPHIC>",
         "position": 4
      },
      {
         "token": "好",
         "start_offset": 9,
         "end_offset": 10,
         "type": "<IDEOGRAPHIC>",
         "position": 5
      },
      {
         "token": "的",
         "start_offset": 10,
         "end_offset": 11,
         "type": "<IDEOGRAPHIC>",
         "position": 6
      },
      {
         "token": "朋",
         "start_offset": 11,
         "end_offset": 12,
         "type": "<IDEOGRAPHIC>",
         "position": 7
      },
      {
         "token": "友",
         "start_offset": 12,
         "end_offset": 13,
         "type": "<IDEOGRAPHIC>",
         "position": 8
      }
   ]
}

By using ICU tokenizer.

GET /_analyze?tokenizer=icu_tokenizer
{
"PTR是我最好的朋友"
}

You will get following response.

{
   "tokens": [
      {
         "token": "PTR",
         "start_offset": 3,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "是",
         "start_offset": 6,
         "end_offset": 7,
         "type": "<IDEOGRAPHIC>",
         "position": 2
      },
      {
         "token": "我",
         "start_offset": 7,
         "end_offset": 8,
         "type": "<IDEOGRAPHIC>",
         "position": 3
      },
      {
         "token": "最好的",
         "start_offset": 8,
         "end_offset": 11,
         "type": "<IDEOGRAPHIC>",
         "position": 4
      },
      {
         "token": "朋友",
         "start_offset": 11,
         "end_offset": 13,
         "type": "<IDEOGRAPHIC>",
         "position": 5
      }
   ]
}

ICU tokenizer uses dictionary based approach to identify word in some Asian languages.

Prevoius Next Home

Programming for beginners

Thursday, 19 November 2015

Elasticsearch: ICU Tokenizer

No comments:

Post a Comment