It is the default analyzer for string
fields, which are analyzable. It is built using the Standard Tokenizer, Lower
Case Token Filter, and Stop Token Filter.
Standard
Tokenizer
Standard tokenizer implements the
Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex
#29. This tokenizer is good for most European languages. In case of Asian
languages, it may not work well.
POST _analyze?tokenizer=standard { "Age is an issue of mind over matter." }
You will get
following response.
{ "tokens": [ { "token": "Age", "start_offset": 5, "end_offset": 8, "type": "<ALPHANUM>", "position": 1 }, { "token": "is", "start_offset": 9, "end_offset": 11, "type": "<ALPHANUM>", "position": 2 }, { "token": "an", "start_offset": 12, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 }, { "token": "issue", "start_offset": 15, "end_offset": 20, "type": "<ALPHANUM>", "position": 4 }, { "token": "of", "start_offset": 21, "end_offset": 23, "type": "<ALPHANUM>", "position": 5 }, { "token": "mind", "start_offset": 24, "end_offset": 28, "type": "<ALPHANUM>", "position": 6 }, { "token": "over", "start_offset": 29, "end_offset": 33, "type": "<ALPHANUM>", "position": 7 }, { "token": "matter", "start_offset": 34, "end_offset": 40, "type": "<ALPHANUM>", "position": 8 } ] }
Lower Case Token Filter
Lower case
token filter is most frequently used token filter, transforms each token into
its lowercase form.
Without
lowercase tokenizer
POST _analyze?tokenizer=standard { "AGE IS AN ISSUE OF MIND OVER MATTER" }
You will get
following response.
{ "tokens": [ { "token": "AGE", "start_offset": 5, "end_offset": 8, "type": "<ALPHANUM>", "position": 1 }, { "token": "IS", "start_offset": 9, "end_offset": 11, "type": "<ALPHANUM>", "position": 2 }, { "token": "AN", "start_offset": 12, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 }, { "token": "ISSUE", "start_offset": 15, "end_offset": 20, "type": "<ALPHANUM>", "position": 4 }, { "token": "OF", "start_offset": 21, "end_offset": 23, "type": "<ALPHANUM>", "position": 5 }, { "token": "MIND", "start_offset": 24, "end_offset": 27, "type": "<ALPHANUM>", "position": 6 }, { "token": "OVER", "start_offset": 28, "end_offset": 32, "type": "<ALPHANUM>", "position": 7 }, { "token": "MATTER", "start_offset": 33, "end_offset": 39, "type": "<ALPHANUM>", "position": 8 } ] }
By using
lowercase filter.
POST _analyze?filters=lowercase { "AGE IS AN ISSUE OF MIND OVER MATTER" }
You will get
following response.
{ "tokens": [ { "token": "age", "start_offset": 5, "end_offset": 8, "type": "<ALPHANUM>", "position": 1 }, { "token": "is", "start_offset": 9, "end_offset": 11, "type": "<ALPHANUM>", "position": 2 }, { "token": "an", "start_offset": 12, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 }, { "token": "issue", "start_offset": 15, "end_offset": 20, "type": "<ALPHANUM>", "position": 4 }, { "token": "of", "start_offset": 21, "end_offset": 23, "type": "<ALPHANUM>", "position": 5 }, { "token": "mind", "start_offset": 24, "end_offset": 28, "type": "<ALPHANUM>", "position": 6 }, { "token": "over", "start_offset": 29, "end_offset": 33, "type": "<ALPHANUM>", "position": 7 }, { "token": "matter", "start_offset": 34, "end_offset": 40, "type": "<ALPHANUM>", "position": 8 } ] }
Stop Token Filter
A token
filter of type ‘stop’ is used to remove stop words.
PUT /blog { "settings":{ "analysis":{ "filter":{ "my_stop":{ "type":"stop", "stopwords":["is", "of", "an", "over"] } } } } }
No comments:
Post a Comment