Improve performance of COUNT/GROUP-BY in large PostgresSQL table?
Main problem is the missing index. But there is more.
SELECT user_id, count(*) AS ct
FROM treenode
WHERE project_id = 1
GROUP BY user_id;
You have many
bigint
columns. Probably overkill. Typically,integer
is more than enough for columns likeproject_id
anduser_id
. This would also help the next item.
While optimizing the table definition, consider this related answer, with an emphasis on data alignment and padding. But most of the rest applies, too:- Configuring PostgreSQL for read performance
The elephant in the room: there is no index on
project_id
. Create one. This is more important than the rest of this answer.
While being at it, make that a multicolumn index:CREATE INDEX treenode_project_id_user_id_index ON treenode (project_id, user_id);
If you followed my advice,
integer
would be perfect here:- Is a composite index also good for queries on the first field?
user_id
is definedNOT NULL
, socount(user_id)
is equivalent tocount(*)
, but the latter is a bit shorter and faster. (In this specific query, this would even apply withoutuser_id
being definedNOT NULL
.)id
is already the primary key, the additionalUNIQUE
constraint is useless ballast. Drop it:"treenode_pkey" PRIMARY KEY, btree (id)
"treenode_id_key" UNIQUE CONSTRAINT, btree (id)Aside: I'd not use
id
as column name. Use something descriptive liketreenode_id
.
Added information
Q: How many different project_id and user_id?
A: not more than five different project_id
.
That means Postgres has to read about 20% of the whole table to satisfy your query. Unless it can use an index-only scan, a sequential scan on the table will be faster than involving any indexes. No more performance to gain here - except by optimizing the table and server settings.
As for the index-only scan: To see how effective that can be, run VACUUM ANALYZE
if you can afford that (locks the table exclusively). Then try your query again. It should now be moderately faster using only the index. Read this related answer first:
- Optimize simple query using ORDER BY date and text
As well as the manual page added with Postgres 9.6 and the Postgres Wiki on index-only scans.
I'd first add an index on (project_id, user_id)
and then in 9.3 version, try this query:
SELECT u.user_id, c.number_of_nodes
FROM users AS u
, LATERAL
( SELECT COUNT(*) AS number_of_nodes
FROM treenode AS t
WHERE t.project_id = 1
AND t.user_id = u.user_id
) c
-- WHERE c.number_of_nodes > 0 ; -- you probably want this as well
-- to show only relevant users
In 9.2, try this one:
SELECT u.user_id,
( SELECT COUNT(*)
FROM treenode AS t
WHERE t.project_id = 1
AND t.user_id = u.user_id
) AS number_of_nodes
FROM users AS u ;
I assume you have a users
table. If not, replace users
with:
(SELECT DISTINCT user_id FROM treenode)