Full text search index on a multilingual column

I've just made a Postgres function to test the text language. It's not perfect but it works for long texts.

CREATE OR REPLACE FUNCTION get_language(t text) RETURNS regconfig AS $$
DECLARE
    ret regconfig;
    BEGIN
            WITH l as ( SELECT cfgname, to_tsvector(cfgname::regconfig, title) as vector, length(to_tsvector(cfgname::regconfig, title)) as len
        FROM pg_ts_config, (select t as title) as ti)
    SELECT cfgname::regconfig
    INTO ret 
    FROM l
    WHERE len=(SELECT MIN(len) FROM l)
    ORDER BY cfgname='simple' DESC, cfgname ASC
    LIMIT 1;
    RETURN ret;
    END;
$$ LANGUAGE plpgsql;

It just look for the shortest tsvector for the given text (so it tries every ts config of postgres).


Is it possible to create an index that works for several languages ?

Yes, but you need a second column that identifies the language of the text. Say you added a column doc_language to the table; you could then write:

CREATE INDEX title_idx ON shows USING gin(to_tsvector(doc_language, title));

Of course, this requires that you know the language of the subject text, something that can be hard to do in practice. If you don't need stemming, etc, you can just use the language simple, but I'm guessing you would've done that already if it were an option.

As an alternative, if you have a fixed and limited set of languages, you can concatenate the vectors for the different languages. E.g.:

regress=> SELECT to_tsvector('english', 'cafés') || to_tsvector('french', 'cafés') || to_tsvector('simple', 'cafés');
          ?column?          
----------------------------
 'caf':2 'café':1 'cafés':3
(1 row)

That'll match a tsquery for cafés in any of those three languages.

As an index:

CREATE INDEX title_idx ON shows USING gin((
    to_tsvector('english', title) || 
    to_tsvector('french', title) || 
    to_tsvector('simple', title)
));

but this is clumsy to use in queries, as the planner isn't very smart about matching index quals. So I'd wrap it in a function:

CREATE FUNCTION to_tsvector_multilang(text) RETURNS tsvector AS $$
SELECT to_tsvector('english', $1) || 
       to_tsvector('french', $1) || 
       to_tsvector('simple', $1)
$$ LANGUAGE sql IMMUTABLE;

CREATE INDEX title_idx ON shows USING gin(to_tsvector_multilang(title));

If you want you can even get fancy: pass the list of languages as an array (but remember it'll have to be exactly the same order for an index qual match to work). Use priorities with setweight, so you prefer a match in English to one in French, say. All sorts of options.

Tags:

Sql

Postgresql