In computer
terminology, stop words are words which are filtered out before or after
processing of natural language data. If you consider any language, many words
repeat frequently, which have very less impact in searching and finding
relativeness.
For example,
Words like
and, or, was, is, this, that are the common words in English, which repeat
frequently.
Below are
the stop words for English language, used by elastic search.
a, an, and, are, as, at, be, but, by, for, if, in,
into, is, it, no, not, of, on, or, such, that, the, their, then, there, these,
they, this, to, was, will, with
Above stop
words are filtered while indexing time.
How to specify stop words
Stop words
can be specified using
1. Stop token filter
2. Specifying while creating custom
analyzer
Stop token filter
Stop token
filter is used to remove stop words from given token streams.
Below snippet
creates custom token filter “my_stop”.
PUT /blog { "settings": { "analysis": { "filter":{ "my_stop":{ "type" : "stop", "stopwords": ["and", "is", "the"] } }, "analyzer": { "custom_analyzer":{ "tokenizer" : "standard", "filter":[ "lowercase", "my_stop"] } } } } }
POST /blog/_analyze?analyzer=custom_analyzer {"PTR and krishna are friends"}
Since ‘and’
is a stop word, it is removed from the response.
{ "tokens": [ { "token": "ptr", "start_offset": 2, "end_offset": 5, "type": "<ALPHANUM>", "position": 1 }, { "token": "krishna", "start_offset": 10, "end_offset": 17, "type": "<ALPHANUM>", "position": 3 }, { "token": "are", "start_offset": 18, "end_offset": 21, "type": "<ALPHANUM>", "position": 4 }, { "token": "friends", "start_offset": 22, "end_offset": 29, "type": "<ALPHANUM>", "position": 5 } ] }
Specifying stop words
Delete index
“blog”, if it exists.
PUT blog { "settings": { "analysis": { "analyzer" :{ "custom_english_analyzer" : { "type" : "english", "stopwords" : ["and", "is", "the"] } } } } }
POST /blog/_analyze?analyzer=custom_english_analyzer {"hari and ptr are friends"}
Since ‘and’
is a stop word, it will not appear in the result.
{ "tokens": [ { "token": "hari", "start_offset": 2, "end_offset": 6, "type": "<ALPHANUM>", "position": 1 }, { "token": "ptr", "start_offset": 11, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 }, { "token": "ar", "start_offset": 15, "end_offset": 18, "type": "<ALPHANUM>", "position": 4 }, { "token": "friend", "start_offset": 19, "end_offset": 26, "type": "<ALPHANUM>", "position": 5 } ] }
You can also
specify stop words for a specific language using _lang_ notation. ("stopwords"
: "_english").
PUT /blog { "settings": { "analysis": { "analyzer" :{ "custom_english_analyzer" : { "type" : "english", "stopwords" : "_english_" } } } } }
POST /blog/_analyze?analyzer=custom_english_analyzer {"Hari and PTR are friends"}
You will get
following response.
{ "tokens": [ { "token": "hari", "start_offset": 2, "end_offset": 6, "type": "<ALPHANUM>", "position": 1 }, { "token": "ptr", "start_offset": 11, "end_offset": 14, "type": "<ALPHANUM>", "position": 3 }, { "token": "friend", "start_offset": 19, "end_offset": 26, "type": "<ALPHANUM>", "position": 5 } ] }
No comments:
Post a Comment