- Adds, removes or changes characters
- Analyzers contain zero or more character filters
- Character filters are applied in the order in which they are specified
- Example (
html strip
filter)- Input: "
I'm in a <em>good</em> mood - and I <strong>love</strong> coffee!
" - Output: "I'm in a good mood - and I love coffee!"
- Input: "
- An analyzer contains one tokenizer
- Tokenizes a string, i.e. splits it into tokens
- Characters may be stripped part of tokenization
- Example
- Input: "I REALLY like beer!"
- Output: ["I", "REALLY", "like", "beer"]
- The tokenizer also records the character offsets for each token in the original string
- Receive the output of the tokenizer as input (i.e. the tokens)
- A token filter can add, remove, or modify tokens
- An analyzer contains zero or more token filters
- Token filters are applied in the order in which they are specified
- Example (
lowercase
filter)- Input: "I REALLY like beer!"
- Output: ["i", "really", "like", "beer"]
- Built-in analyzers, character filters, tokenizers, and token filters are available
- We can also build custom ones
- Standard analyzer is the default analyzer of ES
- It is used for all
text
fields unless configured otherwise
Input: "I REALLY like beer!"
Output: "I REALLY like beer!"
- No character filter is used by default
- So the text is passed to the tokenizer as it is
Input: "I REALLY like beer!"
Output: ["I", "REALLY", "like", "beer"]
- The tokenizer splits the text into tokens according to the unicode segmentation algorithm
- Essentially it breaks sentences by whitespaces, hyphens etc
- In the process, it also throws away punctuations such as commas, periods, exclamation marks etc
Input: ["I", "REALLY", "like", "beer"]
Output: ["i", "really", "like", "beer"]
- The tokenizer splits the text into tokens according to the unicode segmentation algorithm
- Essentially it breaks sentences by whitespaces, hyphens etc
- In the process, it also throws away punctuations such as commas, periods, exclamation marks etc
Similarly
Request:
POST _analyze
{
"text": "2 guys walk into a bar, but the third... DUCKS! :-)"
}
Response:
{
"tokens" : [
{
"token" : "2",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<NUM>",
"position" : 0
},
{
"token" : "guys",
"start_offset" : 2,
"end_offset" : 6,
"type" : "<ALPHANUM>",
"position" : 1
},
{
"token" : "walk",
"start_offset" : 7,
"end_offset" : 11,
"type" : "<ALPHANUM>",
"position" : 2
},
{
"token" : "into",
"start_offset" : 12,
"end_offset" : 16,
"type" : "<ALPHANUM>",
"position" : 3
},
{
"token" : "a",
"start_offset" : 21,
"end_offset" : 22,
"type" : "<ALPHANUM>",
"position" : 4
},
{
"token" : "bar",
"start_offset" : 23,
"end_offset" : 26,
"type" : "<ALPHANUM>",
"position" : 5
},
{
"token" : "but",
"start_offset" : 28,
"end_offset" : 31,
"type" : "<ALPHANUM>",
"position" : 6
},
{
"token" : "the",
"start_offset" : 32,
"end_offset" : 35,
"type" : "<ALPHANUM>",
"position" : 7
},
{
"token" : "third",
"start_offset" : 36,
"end_offset" : 41,
"type" : "<ALPHANUM>",
"position" : 8
},
{
"token" : "ducks",
"start_offset" : 45,
"end_offset" : 50,
"type" : "<ALPHANUM>",
"position" : 9
}
]
}
Essentially all these queries are same:
-
POST _analyze { "text": "2 guys walk into a bar, but the third... DUCKS! :-)" }
-
POST _analyze { "text": "2 guys walk into a bar, but the third... DUCKS! :-)", "analyzer": "standard" }
-
POST _analyze { "text": "2 guys walk into a bar, but the third... DUCKS! :-)", "char_filter": [], "tokenizer": "standard", "filter": ["lowercase"] }