Full text search over multiple related tables: indices and performance

Optimization

You're going down the right track.

You either need to

Denormalize
Cache

Caching the results

What you probably want is a MATERIALIZED VIEW. This is easy and works reasonably well.

CREATE MATERIALIZED VIEW foo
AS
SELECT t.id, to_tsvector(concat_ws(' ',a.name, o.address, t.description, a.name)) AS tsv
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id 
;

Then just

SELECT * FROM foo WHERE tsv @@ plainto_tsquery('foo bar');

Denormalizing the table

This can take a lot of forms, you've got this right though..

Redesign

Searching everything in a fuzzy fashion like this is a losing game. Even this knock off of Dungeon and Dragons meets Yahoo Answers has rules.

enter image description here

It becomes a lot easier to generate a query when you introduce syntax likes [text] for tagging, and is:answer to search just answers, rather than rebuilding Google and normalized indexes.

This looks like a fairly generic full-text search problem in a relational database.

Your prediction that updates in actors or objects would be troublesome in a denormalized structure looks spot-on. Better exhaust the possibilities with the normalized schema before thinking of denormalizing, especially since your tables are modest in size.

I'd suggest to FT-index all textual fields separately, and use a query engineered on the idea of querying all of them and combining results with the OR logical conjunction through an UNION.

Indexing (with the simple the text configuration for exact and language-agnostic matching, but use whatever is best in your case provided it's the same as in the query):

create index idx1 on objects using gin(to_tsvector('simple', name||' '||address));
create index idx2 on tasks using gin(to_tsvector('simple', description));
create index idx3 on actors using gin(to_tsvector('simple', name));

Searching for word1 or word2 anywhere in the indexed expressions :

WITH
 words(w) AS (VALUES ('word1'), ('word2')),
 matching_objects(id) as (select o.* from objects as o, words where to_tsquery('simple',w) @@ to_tsvector('simple', o.name||' '||o.address)),
 matching_tasks as (select t.* from tasks as t, words where to_tsquery('simple',w) @@ to_tsvector('simple', t.description)),
 matching_actors as (select a.* from actors as a, words where to_tsquery('simple',w) @@ to_tsvector('simple', a.name))
SELECT * FROM (
 SELECT t.id, t.description, a.name as actor_name, o.name as object_name
   FROM matching_tasks AS t JOIN actors AS a ON t.actor_id = a.id JOIN objects AS o ON t.object_id = o.id
UNION
  SELECT t.id, t.description, a.name as actor_name, o.name as object_name
    FROM tasks AS t JOIN matching_actors AS a ON t.actor_id = a.id JOIN objects AS o ON t.object_id = o.id
UNION
  SELECT t.id, t.description, a.name as actor_name, o.name as object_name
    FROM tasks AS t JOIN actors AS a ON t.actor_id = a.id JOIN matching_objects AS o ON t.object_id = o.id
) AS result;

Searching for word1 AND word2 in the same field would work by replacing

 words(w) AS (VALUES ('word1'), ('word2'))

with

 words(w) AS (VALUES ('word1 & word2'))

If word1 AND word2 must be present simultaneously in the same "task" (including joined tables) but not necessarily in the same field, it should be workable by adding a GROUP BY step on top of the above, filtering out the results that don't have exactly N hits when N words are searched for.

The query becomes:

WITH
 words(w) AS (VALUES ('word1'), ('word2')),
 matching_objects as (select w, o.* from objects as o, words where to_tsquery('simple',w) @@ to_tsvector('simple', o.name||' '||o.address)),
 matching_tasks as (select w,t .* from tasks as t, words where to_tsquery('simple',w) @@ to_tsvector('simple', t.description)),
 matching_actors as (select w, a.* from actors as a, words where to_tsquery('simple',w) @@ to_tsvector('simple', a.name))
SELECT id FROM (
 SELECT w, t.id
   FROM matching_tasks AS t JOIN actors AS a ON t.actor_id = a.id JOIN objects AS o ON t.object_id = o.id
UNION
  SELECT w, t.id
    FROM tasks AS t JOIN matching_actors AS a ON t.actor_id = a.id JOIN objects AS o ON t.object_id = o.id
UNION
  SELECT w, t.id
    FROM tasks AS t JOIN actors AS a ON t.actor_id = a.id JOIN matching_objects AS o ON t.object_id = o.id
) AS r GROUP BY id HAVING count(*)=(select count(*) FROM words);

The fact that UNION deduplicates the tuples takes care of filtering cases when the same word is found among different subqueries of the UNION construct.

This query produces only IDs of tasks. They'd need to be joined against actors and objects again to get back the columns that need to be displayed or returned.

Full text search over multiple related tables: indices and performance

Optimization

Caching the results

Denormalizing the table

Redesign

Tags:

Performance

Postgresql

Full Text Search

Postgresql Performance

Related

Recent Posts