Filter on array text[] and sort on timestamp
General considerations
Index optimization always depends on the complete picture. Table size, row size, cardinalities, value frequencies, selectivity of typical queries, Postgres version, typical access patterns, etc.
Your case is particularly difficult for two reasons:
- Different columns used in
WHERE
andORDER BY
. Filter on array is most efficient with GIN or GiST index, but neither index type produces sorted output. The manual:
Of the index types currently supported by PostgreSQL, only B-tree can produce sorted output — the other index types return matching rows in an unspecified, implementation-dependent order.
You can create a multicolumn GIN index on (tags, maker_date)
or even more columns (the order of index columns is irrelevant for GIN indexes). But you need to have the additional module btree_gin
installed. Instructions:
- Inner join using an array column
And it's not going to help for the ORDER BY
component of your problem.
One more clarification: OFFSET m LIMIT n
is typically almost as expensive as LIMIT m+n
.
Solution for added specifications
You clarified: only 6 distinct tags possible. That's crucial.
Your table is big and your table definition leaves room for improvements. Size matters for big tables. Your numbers (30 GB, 10 million rows) also suggest a big avg. row size of ~ 3 KB. Either you have more columns than you show or table bloat and need a VACUUM FULL
run (or similar) or your value
column is big and TOASTed, which would make my improvements even more effective since the main relation is cut down to half its size or less with this:
CREATE TABLE tags_tmp (
id int PRIMARY KEY -- assuming PK
, tags int NOT NULL -- also assuming NOT NULL
, value text
, maker_date timestamp NOT NULL -- NOT NULL!
);
The order of columns is relevant because of alignment padding. Details:
- Configuring PostgreSQL for read performance
More importantly, this: tags int
. Why?
Arrays have a sizeable overhead of 24 byte (similar to a row), plus actual items.
- Calculating and saving space in PostgreSQL
So a text[]
with 1-6 items like you demonstrate ('funny', 'inspiration', ...) occupies ~ 56 bytes on avg.
And 6 distinct values can be represented by only 6 bits of information (assuming that sort order of the array is irrelevant).
We could compress even more, but I chose the convenient integer
type (occupies 4 bytes), which provides space for up to 31 distinct tags. That leaves room for later additions without changes to the table schema. Detailed rationale:
- Should I use the PostgreSQL bit string?
Your tags map to bits in a bitmap, 'a'
being the least significant bit (to the right):
tag: a | b | c | d | e | f
position: 0 | 1 | 2 | 3 | 4 | 5
int value: 1 | 2 | 4 | 8 | 16 | 32
So the tag array '{a,d,f}'
maps to 41
. You can use any arbitrary strings instead of 'a'-'f', doesn't matter.
To encapsulate the logic I suggest two auxiliary functions, easily expandable:
tags -> integer:
CREATE OR REPLACE FUNCTION f_tags2int(text[])
RETURNS int AS
$func$
SELECT bit_or(CASE x
WHEN 'a' THEN 1
WHEN 'b' THEN 2
WHEN 'c' THEN 4
WHEN 'd' THEN 8
WHEN 'e' THEN 16
WHEN 'f' THEN 32
-- more?
END)
FROM unnest ($1) x
$func$ LANGUAGE SQL IMMUTABLE;
integer -> tags:
CREATE OR REPLACE FUNCTION f_int2tags(int)
RETURNS text[] AS
$func$
SELECT array_remove(ARRAY [CASE WHEN $1 & 1 > 0 THEN 'a' END
, CASE WHEN $1 & 2 > 0 THEN 'b' END
, CASE WHEN $1 & 4 > 0 THEN 'c' END
, CASE WHEN $1 & 8 > 0 THEN 'd' END
, CASE WHEN $1 & 16 > 0 THEN 'e' END
, CASE WHEN $1 & 32 > 0 THEN 'f' END], NULL)
-- more?
$func$ LANGUAGE SQL IMMUTABLE;
Basics here:
- Can I convert a bunch of boolean columns to a single bitmap in PostgreSQL?
For convenience, you can add a view to display tags as text array like you had it:
CREATE VIEW tags_tmp_pretty AS
SELECT id, tags
, f_int2tags(tags) AS tags_pretty
, maker_date, value
FROM tags_tmp;
Now your basic query can be:
SELECT id, tags, maker_date, value
FROM tags_tmp
WHERE tags & f_tags2int('{a,b}') > 0 -- any of the tags matched
ORDER by maker_date DESC
LIMIT 5;
Using the binary AND operator &
. There are more operators to manipulate the column. get_bit()
and set_bit()
from the binary string operators are also convenient.
The above query should be faster already, for the smaller size and cheaper operators alone, but nothing revolutionary, yet. To make it fast we need indices, and the above cannot use an index, yet.
One partial index for every tag:
CREATE INDEX foo_tag_a ON tags_tmp(maker_date DESC) WHERE tags & 1 > 0;
CREATE INDEX foo_tag_b ON tags_tmp(maker_date DESC) WHERE tags & 2 > 0;
...
CREATE INDEX foo_tag_f ON tags_tmp(maker_date DESC) WHERE tags & 32 > 0;
This query is equivalent to the above, but can utilize the indexes:
SELECT *
FROM tags_tmp_pretty
WHERE (tags & f_tags2int('{a}') > 0 -- same as tags & 1
OR tags & f_tags2int('{e}') > 0) -- same as tags & 32
ORDER BY maker_date DESC
LIMIT 10;
Postgres can combine several bitmap index scans in a BitmapOr
step very efficiently - as demonstrated in this SQL Fiddle.
You could add another index condition to limit indexes to maker_date
> some constant timestamp (and repeat the verbatim condition in queries) to cut down their size (massively). Related example:
- Index optimization with dates
More sophisticated:
- Add datetime constraint to a PostgreSQL multi-column partial index
Other related answers:
How to speed up ORDER BY sorting when using GIN index in PostgreSQL?
Can spatial index help a "range - order by - limit" query
Or just 6 boolean
columns ...
Simply 6 boolean columns might be an even better choice. There are some pros & cons to either solution ...
CREATE TABLE tags_tmp (
id int PRIMARY KEY -- assuming PK
, tag_a bool
, tag_b bool
...
, tag_f bool
, value text
, maker_date timestamp NOT NULL -- NOT NULL!
);
You might define the flags NOT NULL
, depending on your complete use case.
Consider:
- Should I use the PostgreSQL bit string?
Partial indexes simply:
CREATE INDEX foo_tag_a ON tags_tmp(maker_date DESC) WHERE tag_a;
CREATE INDEX foo_tag_b ON tags_tmp(maker_date DESC) WHERE tag_b;
Etc.
Alternative for your special case
Thinking some more, since all of your few tags are so common, and combining multiple tags with OR is even less selective, it will be fastest to only have a btree index on maker_date DESC
. Postgres can traverse the index and filter qualifying rows on tags. This will work in combination with separate boolean columns instead of the array or encoded integer, because Postgres has more useful columns statistics for separate columns.
CREATE INDEX tags_tmp_date ON tags_tmp(maker_date DESC);
And then:
SELECT *
FROM tags_tmp_pretty
WHERE tag_a
OR tag_b
ORDER BY maker_date DESC
LIMIT 10;
Paging
You need an unambiguous sort order for result sets, to make paging work. I did not bother in this answer, it's too long already. Typically you would add more columns to ORDER BY
. How to make paging work efficiently with that:
- Optimize query with OFFSET on large table