Friday 20 November 2015

Elasticsearch: Stop words

In computer terminology, stop words are words which are filtered out before or after processing of natural language data. If you consider any language, many words repeat frequently, which have very less impact in searching and finding relativeness.

For example,
Words like and, or, was, is, this, that are the common words in English, which repeat frequently.

Below are the stop words for English language, used by elastic search.

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

Above stop words are filtered while indexing time.

How to specify stop words
Stop words can be specified using
         1. Stop token filter
         2. Specifying while creating custom analyzer

Stop token filter
Stop token filter is used to remove stop words from given token streams.


Below snippet creates custom token filter “my_stop”.
PUT /blog
{
  "settings": {
    "analysis": {
      "filter":{
        "my_stop":{
          "type" : "stop",
          "stopwords": ["and", "is", "the"]
        }
      },
      "analyzer": {
        "custom_analyzer":{
          "tokenizer" : "standard",
          "filter":[
            "lowercase",
            "my_stop"]
        }
      }
    }
  }
}

POST /blog/_analyze?analyzer=custom_analyzer
{"PTR and krishna are friends"}


Since ‘and’ is a stop word, it is removed from the response.
{
   "tokens": [
      {
         "token": "ptr",
         "start_offset": 2,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "krishna",
         "start_offset": 10,
         "end_offset": 17,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "are",
         "start_offset": 18,
         "end_offset": 21,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "friends",
         "start_offset": 22,
         "end_offset": 29,
         "type": "<ALPHANUM>",
         "position": 5
      }
   ]
}


Specifying stop words
Delete index “blog”, if it exists.
PUT blog
{
 "settings": {
  "analysis": {
   "analyzer" :{
    "custom_english_analyzer" : {
     "type" : "english",
     "stopwords" : ["and", "is", "the"]
    }
   } 
  }
 }
}

POST /blog/_analyze?analyzer=custom_english_analyzer
{"hari and ptr are friends"}


Since ‘and’ is a stop word, it will not appear in the result.
{
   "tokens": [
      {
         "token": "hari",
         "start_offset": 2,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "ptr",
         "start_offset": 11,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "ar",
         "start_offset": 15,
         "end_offset": 18,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "friend",
         "start_offset": 19,
         "end_offset": 26,
         "type": "<ALPHANUM>",
         "position": 5
      }
   ]
}


You can also specify stop words for a specific language using _lang_ notation. ("stopwords" : "_english"). 
Delete blog index.
PUT /blog
{
 "settings": {
  "analysis": {
   "analyzer" :{
    "custom_english_analyzer" : {
     "type" : "english",
     "stopwords" : "_english_"
    }
   } 
  }
 }
}

POST /blog/_analyze?analyzer=custom_english_analyzer
{"Hari and PTR are friends"}


You will get following response.
{
   "tokens": [
      {
         "token": "hari",
         "start_offset": 2,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "ptr",
         "start_offset": 11,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "friend",
         "start_offset": 19,
         "end_offset": 26,
         "type": "<ALPHANUM>",
         "position": 5
      }
   ]
}






Prevoius                                                 Next                                                 Home

No comments:

Post a Comment