Programming for beginners: Elasticsearch: Stop words

In computer terminology, stop words are words which are filtered out before or after processing of natural language data. If you consider any language, many words repeat frequently, which have very less impact in searching and finding relativeness.

For example,

Words like and, or, was, is, this, that are the common words in English, which repeat frequently.

Below are the stop words for English language, used by elastic search.

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

Above stop words are filtered while indexing time.

How to specify stop words

Stop words can be specified using

1. Stop token filter

2. Specifying while creating custom analyzer

Stop token filter

Stop token filter is used to remove stop words from given token streams.

Below snippet creates custom token filter “my_stop”.

PUT /blog
{
  "settings": {
    "analysis": {
      "filter":{
        "my_stop":{
          "type" : "stop",
          "stopwords": ["and", "is", "the"]
        }
      },
      "analyzer": {
        "custom_analyzer":{
          "tokenizer" : "standard",
          "filter":[
            "lowercase",
            "my_stop"]
        }
      }
    }
  }
}

POST /blog/_analyze?analyzer=custom_analyzer
{"PTR and krishna are friends"}

Since ‘and’ is a stop word, it is removed from the response.

{
   "tokens": [
      {
         "token": "ptr",
         "start_offset": 2,
         "end_offset": 5,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "krishna",
         "start_offset": 10,
         "end_offset": 17,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "are",
         "start_offset": 18,
         "end_offset": 21,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "friends",
         "start_offset": 22,
         "end_offset": 29,
         "type": "<ALPHANUM>",
         "position": 5
      }
   ]
}

Specifying stop words

Delete index “blog”, if it exists.

PUT blog
{
 "settings": {
  "analysis": {
   "analyzer" :{
    "custom_english_analyzer" : {
     "type" : "english",
     "stopwords" : ["and", "is", "the"]
    }
   } 
  }
 }
}

POST /blog/_analyze?analyzer=custom_english_analyzer
{"hari and ptr are friends"}

Since ‘and’ is a stop word, it will not appear in the result.

{
   "tokens": [
      {
         "token": "hari",
         "start_offset": 2,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "ptr",
         "start_offset": 11,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "ar",
         "start_offset": 15,
         "end_offset": 18,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "friend",
         "start_offset": 19,
         "end_offset": 26,
         "type": "<ALPHANUM>",
         "position": 5
      }
   ]
}

You can also specify stop words for a specific language using _lang_ notation. ("stopwords" : "_english").

Delete blog index.

PUT /blog
{
 "settings": {
  "analysis": {
   "analyzer" :{
    "custom_english_analyzer" : {
     "type" : "english",
     "stopwords" : "_english_"
    }
   } 
  }
 }
}

POST /blog/_analyze?analyzer=custom_english_analyzer
{"Hari and PTR are friends"}

You will get following response.

{
   "tokens": [
      {
         "token": "hari",
         "start_offset": 2,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "ptr",
         "start_offset": 11,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "friend",
         "start_offset": 19,
         "end_offset": 26,
         "type": "<ALPHANUM>",
         "position": 5
      }
   ]
}

Prevoius Next Home

Programming for beginners

Friday, 20 November 2015

Elasticsearch: Stop words

No comments:

Post a Comment