PostgreSQL, trigrams and similarity

The trigram algorithm should be the more accurate the less is the difference in length of compared strings. You can modify the algorithm to compensate the effect of length difference.

The following exemplary function reduces the similarity by 1% for the difference of 1 character in string lenghts. This means that it favors strings of the same (similar) length.

create or replace function corrected_similarity(str1 text, str2 text)
returns float4 language sql as $$
    select similarity(str1, str2)* (1- abs(length(str1)-length(str2))/100.0)::float4
$$;

select 
    winery, 
    similarity(winery, 'chateau chevla blanc') as similarity,
    corrected_similarity(winery, 'chateau chevla blanc') as corrected_similarity
from usr_wines 
where winery % 'chateau chevla blanc'  
order by corrected_similarity desc;

          winery          | similarity | corrected_similarity 
--------------------------+------------+----------------------
 Chateau ChevL Blanc      |       0.85 |               0.8415
 Chateau Cheval Blanc     |   0.727273 |             0.727273
 Chateau Cheval Blanc     |   0.727273 |             0.727273
 Chateau Cheval Blanc     |   0.727273 |             0.727273
 Chateau Blanc,           |   0.736842 |             0.692632
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Blanc            |   0.736842 |             0.685263
 Chateau Cheval Blanc (7) |   0.666667 |                 0.64
 Chateau Du Cheval Blanc  |       0.64 |               0.6208
 Chateau Du Cheval Blanc  |       0.64 |               0.6208
 Chateau Cheval Blanc Cbo |       0.64 |               0.6144
(14 rows)

In a similar way you can correct the standard similarity by, for example, how many initial characters are identical (thought the function will be a bit more complicated).


The concept of trigram similarity relies on having any sentence divided into "trigrams" (sequences of three consecutive letters), and treating the result as a SET (i.e.: the order doesn't matter, and you don't have repeated values). Before the sentence is considered, two blank spaces are added at the beginning, and one at the end, and single spaces are replaced by double ones.

Trigrams are a special case of N-grams.

The trigram set corresponding to "Chateau blanc" is found by finding all sequences of three letters that appear on it:

  chateau  blanc
---                 => '  c'
 ---                => ' ch'
  ---               => 'cha'
   ---              => 'hat'
    ---             => 'ate'
     ---            => 'tea'
      ---           => 'eau'
       ---          => 'au '
        ---         => 'u  '
         ---        => '  b'
          ---       => ' bl'
           ---      => 'bla'
            ---     => 'lan'
             ---    => 'anc'
              ---   => 'nc '

Sorting them, and taking out repetitions gets you:

'  b'
'  c'
' bl'
' ch'
'anc'
'ate'
'au '
'bla'
'cha'
'eau'
'hat'
'lan'
'nc '
'tea'

This can be computed by PostgreSQL by means of the function show_trgm:

SELECT show_trgm('Chateau blanc') AS A

A = [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

... which has 14 trigrams. (Check pg_trgm).

And the trigram set corresponding to "Chateau Cheval Blanc" is:

SELECT show_trgm('Chateau Cheval Blanc') AS B 

B = [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

... which has 19 trigrams

If you count how many trigrams have both sets in common, you find that they have the following ones:

A intersect B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,eau,hat,lan,nc ,tea]

and the ones they have in total are:

A union B = 
    [  b,  c, bl, ch,anc,ate,au ,bla,cha,che,eau,evl,hat,hev,la ,lan,nc ,tea,vla]

That is, both sentences have 14 trigrams in common, and 19 in total.
The similarity is computed as:

 similarity = 14 / 19

You can check it with:

SELECT 
    cast(14.0/19.0 as real) AS computed_result, 
    similarity('Chateau blanc', 'chateau cheval blanc') AS function_in_pg

and you'll see that you get: 0.736842

... which explains how similarity is computed, and why you get the values you get.


NOTE: You can compute the intersection and union by means of:

SELECT 
   array_agg(t) AS in_common
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    INTERSECT 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
    ORDER BY t
) AS trigrams_in_common ;

SELECT 
   array_agg(t) AS in_total
FROM
(
    SELECT unnest(show_trgm('Chateau blanc')) AS t 
    UNION 
    SELECT unnest(show_trgm('chateau chevla blanc')) AS t
) AS trigrams_in_total ;

And this is a way to explore the similarity of different pair of sentences:

WITH p AS
(
    SELECT 
      'This is just a sentence I''ve invented'::text AS f1,
      'This is just a sentence I''ve also invented'::text AS f2
),
t1 AS
(
    SELECT unnest(show_trgm(f1)) FROM p
),
t2 AS
(
    SELECT unnest(show_trgm(f2)) FROM p
),
x AS
(
    SELECT
        (SELECT count(*) FROM 
            (SELECT * FROM t1 INTERSECT SELECT * FROM t2) AS s0)::integer AS same,
        (SELECT count(*) FROM 
            (SELECT * FROM t1 UNION     SELECT * FROM t2) AS s0)::integer AS total,
        similarity(f1, f2) AS sim_2
FROM
    p 
)
SELECT
    same, total, same::real/total::real AS sim_1, sim_2
FROM
    x ;

You can check it at Rextester