Programming for beginners: Elasticsearch: Standard Analyzer

It is the default analyzer for string fields, which are analyzable. It is built using the Standard Tokenizer, Lower Case Token Filter, and Stop Token Filter.

Standard Tokenizer

Standard tokenizer implements the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. This tokenizer is good for most European languages. In case of Asian languages, it may not work well.

POST _analyze?tokenizer=standard
{
  "Age is an issue of mind over matter."
}

You will get following response.

{
   "tokens": [
      {
         "token": "Age",
         "start_offset": 5,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "is",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "an",
         "start_offset": 12,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "issue",
         "start_offset": 15,
         "end_offset": 20,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "of",
         "start_offset": 21,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "mind",
         "start_offset": 24,
         "end_offset": 28,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "over",
         "start_offset": 29,
         "end_offset": 33,
         "type": "<ALPHANUM>",
         "position": 7
      },
      {
         "token": "matter",
         "start_offset": 34,
         "end_offset": 40,
         "type": "<ALPHANUM>",
         "position": 8
      }
   ]
}

Lower Case Token Filter

Lower case token filter is most frequently used token filter, transforms each token into its lowercase form.

Without lowercase tokenizer

POST _analyze?tokenizer=standard
{
  "AGE IS AN ISSUE OF MIND OVER MATTER"
}

You will get following response.

{
   "tokens": [
      {
         "token": "AGE",
         "start_offset": 5,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "IS",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "AN",
         "start_offset": 12,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "ISSUE",
         "start_offset": 15,
         "end_offset": 20,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "OF",
         "start_offset": 21,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "MIND",
         "start_offset": 24,
         "end_offset": 27,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "OVER",
         "start_offset": 28,
         "end_offset": 32,
         "type": "<ALPHANUM>",
         "position": 7
      },
      {
         "token": "MATTER",
         "start_offset": 33,
         "end_offset": 39,
         "type": "<ALPHANUM>",
         "position": 8
      }
   ]
}

By using lowercase filter.

POST _analyze?filters=lowercase
{
  "AGE IS AN ISSUE OF MIND OVER MATTER"
}

You will get following response.

{
   "tokens": [
      {
         "token": "age",
         "start_offset": 5,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "is",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "an",
         "start_offset": 12,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "issue",
         "start_offset": 15,
         "end_offset": 20,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "of",
         "start_offset": 21,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "mind",
         "start_offset": 24,
         "end_offset": 28,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "over",
         "start_offset": 29,
         "end_offset": 33,
         "type": "<ALPHANUM>",
         "position": 7
      },
      {
         "token": "matter",
         "start_offset": 34,
         "end_offset": 40,
         "type": "<ALPHANUM>",
         "position": 8
      }
   ]
}

Stop Token Filter

A token filter of type ‘stop’ is used to remove stop words.

PUT /blog
{
  "settings":{
    "analysis":{
      "filter":{
        "my_stop":{
          "type":"stop",
          "stopwords":["is", "of", "an", "over"]
        }
      }
    }
  }
}

Prevoius Next Home

Programming for beginners

Thursday, 19 November 2015

Elasticsearch: Standard Analyzer

No comments:

Post a Comment