Slow query on large table with GROUP BY and ORDER BY
The LEFT JOIN
in @dezso's answer should be good. An index, however, will hardly be useful (per se), because the query has to read the whole table anyway - the exception being index-only scans in Postgres 9.2+ and favorable conditions, see below.
SELECT m.hash, m.string, count(m.method) AS method_ct
FROM methods m
LEFT JOIN nostring n USING (hash)
WHERE n.hash IS NULL
GROUP BY m.hash, m.string
ORDER BY count(m.method) DESC;
Run EXPLAIN ANALYZE
on the query. Several times to exclude cashing effects and noise. Compare the best results.
Create a multi-column index that matches your query:
CREATE INDEX methods_cluster_idx ON methods (hash, string, method);
Wait? After I said an index wouldn't help? Well, we need it to CLUSTER
the table:
CLUSTER methods USING methods_cluster_idx;
ANALYZE methods;
Rerun EXPLAIN ANALYZE
. Any faster? It should be.
CLUSTER
is a one-time operation to rewrite the whole table in the order of the used index. It is also effectively a VACUUM FULL
. If you want to be sure, you'd run a pre-test with VACUUM FULL
alone to see what can be attributed to that.
If your table sees a lot of write operations, the effect will degrade over time. Schedule CLUSTER
at off-hours to restore the effect. Fine tuning depends of your exact use-case. The manual about CLUSTER
.
CLUSTER
is a rather crude tool, needs an exclusive lock on the table. If you can't afford that, consider pg_repack
which can do the same without exclusive lock. More in this later answer:
- Configuring PostgreSQL for read performance
If the percentage of NULL
values in the column method
is high (more than ~ 20 percent, depending on actual row sizes), a partial index should help:
CREATE INDEX methods_foo_idx ON methods (hash, string)
WHERE method IS NOT NULL;
(Your later update shows your columns to be NOT NULL
, so not applicable.)
If you are running PostgreSQL 9.2 or later (as @deszo commented) the presented indexes may be useful without CLUSTER
if the planner can utilize index-only scans. Only applicable under favorable conditions: No write operations that would effect the visibility map since the last VACUUM
and all columns in the query have to be covered by the index. Basically read-only tables can use this any time, while heavily written tables are limited. More details in the Postgres Wiki.
The above mentioned partial index could be even more useful in that case.
If, on the other hand, there are no NULL
values in column method
, you should
1.) define it NOT NULL
and
2.) use count(*)
instead of count(method)
, that's slightly faster and does the same in the absence of NULL
values.
If you have to call this query often and the table is read-only, create a MATERIALIZED VIEW
.
Exotic fine point:
Your table is named nostring
, yet seems to contain hashes. By excluding hashes instead of strings, there is a chance that you exclude more strings than intended. Extremely unlikely, but possible.
Welcome to DBA.SE!
You can try to rephrase your query like this:
SELECT m.hash, string, count(method)
FROM
methods m
LEFT JOIN nostring n ON m.hash = n.hash
WHERE n.hash IS NULL
GROUP BY hash, string
ORDER BY count(method) DESC;
or another possibility:
SELECT m.hash, string, count(method)
FROM
methods m
WHERE NOT EXISTS (SELECT hash FROM nostring WHERE hash = m.hash)
GROUP BY hash, string
ORDER BY count(method) DESC;
NOT IN
is a typical sink for performance since it is hard to use an index with it.
This may be further enhanced with indexes. An index on nostring.hash
looks useful. But first: what do you get now? (It would be better to see the output of EXPLAIN ANALYZE
since the costs themselves don't tell the time the operations took.)