Thursday 19 November 2015

Elasticsearch: Standard Analyzer

It is the default analyzer for string fields, which are analyzable. It is built using the Standard Tokenizer, Lower Case Token Filter, and Stop Token Filter.

Standard Tokenizer

Standard tokenizer implements the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. This tokenizer is good for most European languages. In case of Asian languages, it may not work well.

POST _analyze?tokenizer=standard
{
  "Age is an issue of mind over matter."
}


You will get following response.
{
   "tokens": [
      {
         "token": "Age",
         "start_offset": 5,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "is",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "an",
         "start_offset": 12,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "issue",
         "start_offset": 15,
         "end_offset": 20,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "of",
         "start_offset": 21,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "mind",
         "start_offset": 24,
         "end_offset": 28,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "over",
         "start_offset": 29,
         "end_offset": 33,
         "type": "<ALPHANUM>",
         "position": 7
      },
      {
         "token": "matter",
         "start_offset": 34,
         "end_offset": 40,
         "type": "<ALPHANUM>",
         "position": 8
      }
   ]
}


Lower Case Token Filter
Lower case token filter is most frequently used token filter, transforms each token into its lowercase form.

Without lowercase tokenizer

POST _analyze?tokenizer=standard
{
  "AGE IS AN ISSUE OF MIND OVER MATTER"
}


You will get following response.

{
   "tokens": [
      {
         "token": "AGE",
         "start_offset": 5,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "IS",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "AN",
         "start_offset": 12,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "ISSUE",
         "start_offset": 15,
         "end_offset": 20,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "OF",
         "start_offset": 21,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "MIND",
         "start_offset": 24,
         "end_offset": 27,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "OVER",
         "start_offset": 28,
         "end_offset": 32,
         "type": "<ALPHANUM>",
         "position": 7
      },
      {
         "token": "MATTER",
         "start_offset": 33,
         "end_offset": 39,
         "type": "<ALPHANUM>",
         "position": 8
      }
   ]
}


By using lowercase filter.

POST _analyze?filters=lowercase
{
  "AGE IS AN ISSUE OF MIND OVER MATTER"
}


You will get following response.

{
   "tokens": [
      {
         "token": "age",
         "start_offset": 5,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "is",
         "start_offset": 9,
         "end_offset": 11,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "an",
         "start_offset": 12,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "issue",
         "start_offset": 15,
         "end_offset": 20,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "of",
         "start_offset": 21,
         "end_offset": 23,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "mind",
         "start_offset": 24,
         "end_offset": 28,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "over",
         "start_offset": 29,
         "end_offset": 33,
         "type": "<ALPHANUM>",
         "position": 7
      },
      {
         "token": "matter",
         "start_offset": 34,
         "end_offset": 40,
         "type": "<ALPHANUM>",
         "position": 8
      }
   ]
}


Stop Token Filter
A token filter of type ‘stop’ is used to remove stop words.

PUT /blog
{
  "settings":{
    "analysis":{
      "filter":{
        "my_stop":{
          "type":"stop",
          "stopwords":["is", "of", "an", "over"]
        }
      }
    }
  }
}





Prevoius                                                 Next                                                 Home

No comments:

Post a Comment