Full text search over multiple related tables: indices and performance
Optimization
You're going down the right track.
You either need to
- Denormalize
- Cache
Caching the results
What you probably want is a MATERIALIZED VIEW
. This is easy and works reasonably well.
CREATE MATERIALIZED VIEW foo
AS
SELECT t.id, to_tsvector(concat_ws(' ',a.name, o.address, t.description, a.name)) AS tsv
FROM tasks AS t
INNER JOIN actors AS a ON t.actor_id = a.id
INNER JOIN objects AS o ON t.object_id = o.id
;
Then just
SELECT * FROM foo WHERE tsv @@ plainto_tsquery('foo bar');
Denormalizing the table
This can take a lot of forms, you've got this right though..
Redesign
Searching everything in a fuzzy fashion like this is a losing game. Even this knock off of Dungeon and Dragons meets Yahoo Answers has rules.
It becomes a lot easier to generate a query when you introduce syntax likes [text]
for tagging, and is:answer
to search just answers, rather than rebuilding Google and normalized indexes.
This looks like a fairly generic full-text search problem in a relational database.
Your prediction that updates in actors or objects would be troublesome in a denormalized structure looks spot-on. Better exhaust the possibilities with the normalized schema before thinking of denormalizing, especially since your tables are modest in size.
I'd suggest to FT-index all textual fields separately, and use a query engineered on the idea of querying all of them and combining results with the OR logical conjunction through an UNION.
Indexing (with the simple
the text configuration for exact and language-agnostic matching, but use whatever is best in your case provided it's the same as in the query):
create index idx1 on objects using gin(to_tsvector('simple', name||' '||address));
create index idx2 on tasks using gin(to_tsvector('simple', description));
create index idx3 on actors using gin(to_tsvector('simple', name));
Searching for word1
or word2
anywhere in the indexed expressions :
WITH
words(w) AS (VALUES ('word1'), ('word2')),
matching_objects(id) as (select o.* from objects as o, words where to_tsquery('simple',w) @@ to_tsvector('simple', o.name||' '||o.address)),
matching_tasks as (select t.* from tasks as t, words where to_tsquery('simple',w) @@ to_tsvector('simple', t.description)),
matching_actors as (select a.* from actors as a, words where to_tsquery('simple',w) @@ to_tsvector('simple', a.name))
SELECT * FROM (
SELECT t.id, t.description, a.name as actor_name, o.name as object_name
FROM matching_tasks AS t JOIN actors AS a ON t.actor_id = a.id JOIN objects AS o ON t.object_id = o.id
UNION
SELECT t.id, t.description, a.name as actor_name, o.name as object_name
FROM tasks AS t JOIN matching_actors AS a ON t.actor_id = a.id JOIN objects AS o ON t.object_id = o.id
UNION
SELECT t.id, t.description, a.name as actor_name, o.name as object_name
FROM tasks AS t JOIN actors AS a ON t.actor_id = a.id JOIN matching_objects AS o ON t.object_id = o.id
) AS result;
Searching for word1
AND word2
in the same field would work by replacing
words(w) AS (VALUES ('word1'), ('word2'))
with
words(w) AS (VALUES ('word1 & word2'))
If word1
AND word2
must be present simultaneously in the same "task" (including joined tables) but not necessarily in the same field, it should be workable by adding a GROUP BY step on top of the above, filtering out the results that don't have exactly N hits when N words are searched for.
The query becomes:
WITH
words(w) AS (VALUES ('word1'), ('word2')),
matching_objects as (select w, o.* from objects as o, words where to_tsquery('simple',w) @@ to_tsvector('simple', o.name||' '||o.address)),
matching_tasks as (select w,t .* from tasks as t, words where to_tsquery('simple',w) @@ to_tsvector('simple', t.description)),
matching_actors as (select w, a.* from actors as a, words where to_tsquery('simple',w) @@ to_tsvector('simple', a.name))
SELECT id FROM (
SELECT w, t.id
FROM matching_tasks AS t JOIN actors AS a ON t.actor_id = a.id JOIN objects AS o ON t.object_id = o.id
UNION
SELECT w, t.id
FROM tasks AS t JOIN matching_actors AS a ON t.actor_id = a.id JOIN objects AS o ON t.object_id = o.id
UNION
SELECT w, t.id
FROM tasks AS t JOIN actors AS a ON t.actor_id = a.id JOIN matching_objects AS o ON t.object_id = o.id
) AS r GROUP BY id HAVING count(*)=(select count(*) FROM words);
The fact that UNION deduplicates the tuples takes care of filtering cases when the same word is found among different subqueries of the UNION construct.
This query produces only IDs of tasks. They'd need to be joined against actors
and objects
again to get back the columns that need to be displayed or returned.