Thursday 19 November 2015

Elasticsearch: Character filters

Character filters are used to pre-process data, before sending it to tokenizer.
 For example,
GET _analyze?tokenizer=standard
{
  " <p> Elastic search is <h1>easy to learn. </h1> </p>"
}
Above query returns following response.
{
   "tokens": [
      {
         "token": "p",
         "start_offset": 7,
         "end_offset": 8,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "Elastic",
         "start_offset": 10,
         "end_offset": 17,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "search",
         "start_offset": 18,
         "end_offset": 24,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "is",
         "start_offset": 25,
         "end_offset": 27,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "h1",
         "start_offset": 29,
         "end_offset": 31,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "easy",
         "start_offset": 32,
         "end_offset": 36,
         "type": "<ALPHANUM>",
         "position": 6
      },
      {
         "token": "to",
         "start_offset": 37,
         "end_offset": 39,
         "type": "<ALPHANUM>",
         "position": 7
      },
      {
         "token": "learn",
         "start_offset": 40,
         "end_offset": 45,
         "type": "<ALPHANUM>",
         "position": 8
      },
      {
         "token": "h1",
         "start_offset": 49,
         "end_offset": 51,
         "type": "<ALPHANUM>",
         "position": 9
      },
      {
         "token": "p",
         "start_offset": 55,
         "end_offset": 56,
         "type": "<ALPHANUM>",
         "position": 10
      }
   ]
}
As you observe the response, it is treating tags like <h1>, <p> as tokens.
We need to pre-process this data, before tokenizing, here we can use character filters.

GET _analyze?tokenizer=standard&char_filters=html_strip
{
  " <p> Elastic search is <h1>easy to learn. </h1> </p>"
}
html_strip char filter is used to remove HTML elements from an analyzed text. Above query returns following response.
{
   "tokens": [
      {
         "token": "Elastic",
         "start_offset": 10,
         "end_offset": 17,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "search",
         "start_offset": 18,
         "end_offset": 24,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "is",
         "start_offset": 25,
         "end_offset": 27,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "easy",
         "start_offset": 32,
         "end_offset": 36,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "to",
         "start_offset": 37,
         "end_offset": 39,
         "type": "<ALPHANUM>",
         "position": 5
      },
      {
         "token": "learn",
         "start_offset": 40,
         "end_offset": 45,
         "type": "<ALPHANUM>",
         "position": 6
      }
   ]
}




Prevoius                                                 Next                                                 Home

No comments:

Post a Comment