how edge ngram token filter differs from ngram token filter?
ngram
moves the cursor while breaking the text:
Text: Red Wine
Options:
ngram_min: 2
ngram_max: 3
Result: Re, Red, ed, Wi, Win, in, ine, ne
As you see here, the cursor moves ngram_min
times to the next fragment until it reaches the ngram_max
.
ngram_edge
does the exact same thing as ngram
but it doesn't move the cursor:
Text: Red Wine
Options:
ngram_min: 2
ngram_max: 3
Result: Re, Red
Why didn't it return Win
? because the cursor doesn't move, it'll always start from the position zero, moves ngram_min
times and backs to the same position (which is always zero).
Think of ngram_edge
as if it was a substring
function in other programming languages such as JavaScript:
// ngram
let str = "Red Wine";
console.log(str.substring(0, 2)); // Re
console.log(str.substring(0, 3)); // Red
console.log(str.substring(1, 3)); // ed, start from position 1
// ...
// ngram_edge
// notice that the position is always zero
console.log(str.substring(0, 2)); // Re
console.log(str.substring(0, 3)); // Red
Try it out by yourself using Kibana:
PUT my_index
{
"settings": {
"analysis": {
"tokenizer": {
"my_ngram_tokenizer" : {
"type" : "ngram",
"min_gram": 2,
"max_gram": 3,
"token_chars": [
"letter",
"digit"
]
},
"my_edge_ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 3
}
}
}
}
}
POST my_index/_analyze
{
"tokenizer": "my_ngram_tokenizer",
"text": "Red Wine"
}
POST my_index/_analyze
{
"tokenizer": "my_edge_ngram_tokenizer",
"text": "Red Wine"
}
I think the documentation is pretty clear on this:
This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.
And the best example for nGram
tokenizer again comes from the documentation:
curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'
# FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04
With this tokenizer definition:
"type" : "nGram",
"min_gram" : "2",
"max_gram" : "3",
"token_chars": [ "letter", "digit" ]
In short:
- the tokenizer, depending on the configuration, will create tokens. In this example:
FC
,Schalke
,04
. nGram
generates groups of characters of minimummin_gram
size and maximummax_gram
size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).edgeNGram
does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.
For the same text as above, an edgeNGram
generates this: FC, Sc, Sch, Scha, Schal, 04
. Every "word" in the text is considered and for every "word" the first character is the starting point (F
from FC
, S
from Schalke
and 0
from 04
).