How to speed up select distinct?
You probably don't want to hear this, but the best option to speed up SELECT DISTINCT
is to avoid DISTINCT
to begin with. In many cases (not all!) it can be avoided with better database-design or better queries.
Sometimes, GROUP BY
is faster, because it takes a different code path.
In your particular case, it doesn't seem like you can get rid of DISTINCT
. But you could support the query with a specialized index if you have many queries of that kind:
CREATE INDEX foo ON events (project_id, "time", user_id);
Adding user_id
is only useful if you get index-only scans out of this. Follow the link for details. Would remove the expensive Bitmap Heap Scan from your query plan, which consumes 90% of the query time.
Your EXPLAIN
output tells me that the query has to condense 2,491 distinct users out of half a million matching rows. This won't become super-fast, no matter what you do, but it can be substantially faster.
If the time intervals in your queries are always the same, a MATERIALIIZED VIEW
folding user_id
per (project_id, <fixed time intervall>)
would go a long way. No chance there with varying time intervals, though. Maybe you could at least fold users per hour or some other minimum time unit, and that would buy enough performance to warrant the considerable overhead.
Nitpick:
Most probably, the predicates on "time"
should really be:
AND "time" >= '2015-01-11 8:00:00'
AND "time" < '2015-02-10 8:00:00';
Aside:
Don't use time
as identifier. It's a reserved word in standard SQL and a basic type in Postgres.
Here's my test on Sam's case and Erwin's answer
drop table t1
create table t1 (id int, user_id int, project_id int, date_time timestamp without time zone) ;
insert into t1 -- 10 million row - size="498 MB"
select row_number() over(), round(row_number() over()/1000), round(row_number() over()/100000) , date
from generate_series('2015-01-01'::date, '2016-12-01'::date,'6 seconds'::interval
) date
limit 10000000
-- before indexing - 10000000 row - output=100 row - time=2900ms
SELECT DISTINCT user_id
FROM t1
WHERE project_id = 1
AND date_time > '2015-01-01 8:00:00'
AND date_time < '2016-12-01 8:00:00' ;
CREATE INDEX foo ON t1 (project_id, date_time, user_id); -- time process=51.2 secs -- size="387 MB"
-- after indexing - 10000000 row - output=100 row - time= 75ms (reduce ~ 38 times)
SELECT DISTINCT user_id
FROM t1
WHERE project_id = 1
AND date_time > '2015-01-01 00:00:00'
AND date_time < '2016-12-01 00:00:00' ;
Erwin said "You probably don't want to hear this, but the best option to speed up SELECT DISTINCT is to avoid DISTINCT to begin with. In many cases (not all!) it can be avoided with better database-design or better queries" . I think he's right, we should avoid using "distinct, group by, order by" (if any).
I met a situation as Sam's case and I think Sam can use partition on event table by month. It'll reduce your data size when you query, but you need a function (pl/pgsql) to execute instead of query above. The function will find appropriate partitions (depend on conditions) to execute query .