how edge ngram token filter differs from ngram token filter?

ngram moves the cursor while breaking the text:

Text: Red Wine

Options:
    ngram_min: 2
    ngram_max: 3

Result: Re, Red, ed, Wi, Win, in, ine, ne

As you see here, the cursor moves ngram_min times to the next fragment until it reaches the ngram_max.

ngram_edge does the exact same thing as ngram but it doesn't move the cursor:

Text: Red Wine

Options:
    ngram_min: 2
    ngram_max: 3

Result: Re, Red

Why didn't it return Win? because the cursor doesn't move, it'll always start from the position zero, moves ngram_min times and backs to the same position (which is always zero).

Think of ngram_edge as if it was a substring function in other programming languages such as JavaScript:

// ngram
let str = "Red Wine";
console.log(str.substring(0, 2)); // Re
console.log(str.substring(0, 3)); // Red
console.log(str.substring(1, 3)); // ed, start from position 1
// ...

// ngram_edge
// notice that the position is always zero
console.log(str.substring(0, 2)); // Re
console.log(str.substring(0, 3)); // Red

Try it out by yourself using Kibana:

PUT my_index
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_ngram_tokenizer" : {
          "type" : "ngram",
          "min_gram": 2,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        },
        "my_edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 3
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "tokenizer": "my_ngram_tokenizer",
  "text": "Red Wine"
}

POST my_index/_analyze
{
  "tokenizer": "my_edge_ngram_tokenizer", 
  "text": "Red Wine"
}

I think the documentation is pretty clear on this:

This tokenizer is very similar to nGram but only keeps n-grams which start at the beginning of a token.

And the best example for nGram tokenizer again comes from the documentation:

curl 'localhost:9200/test/_analyze?pretty=1&analyzer=my_ngram_analyzer' -d 'FC Schalke 04'


    # FC, Sc, Sch, ch, cha, ha, hal, al, alk, lk, lke, ke, 04

With this tokenizer definition:

                    "type" : "nGram",
                    "min_gram" : "2",
                    "max_gram" : "3",
                    "token_chars": [ "letter", "digit" ]

In short:

the tokenizer, depending on the configuration, will create tokens. In this example: FC, Schalke, 04.
nGram generates groups of characters of minimum min_gram size and maximum max_gram size from an input text. Basically, the tokens are split into small chunks and each chunk is anchored on a character (it doesn't matter where this character is, all of them will create chunks).
edgeNGram does the same but the chunks always start from the beginning of each token. Basically, the chunks are anchored at the beginning of the tokens.

For the same text as above, an edgeNGram generates this: FC, Sc, Sch, Scha, Schal, 04. Every "word" in the text is considered and for every "word" the first character is the starting point (F from FC, S from Schalke and 0 from 04).

how edge ngram token filter differs from ngram token filter?

Tags:

Token

Elasticsearch

Analyzer

Related

Recent Posts