Selecting top 10 from indexed field of a big table takes too long
In your sample code above, the index is explicitly created as NULLS LAST
and the query is implicitly running NULLS FIRST
(which is the default for ORDER BY .. DESC
) so PostgreSQL would need to re-sort the data if it used the index. As a result the index would actually make the query many times slower than even the (already-slow) table scan.
rds-9.6.5 root@db1=> create table performance (id integer, installs integer, hour timestamp without time zone);
CREATE TABLE
Time: 28.100 ms
rds-9.6.5 root@db1=> with generator as (select generate_series(1,166530) i)
[more] - > insert into performance (
[more] ( > select
[more] ( > i id,
[more] ( > (random()*1000)::integer installs,
[more] ( > (now() - make_interval(secs => i))::timestamp installs
[more] ( > from generator
[more] ( > );
INSERT 0 166530
Time: 244.872 ms
rds-9.6.5 root@db1=> create index hour_idx
[more] - > on performance
[more] - > using btree
[more] - > (hour desc nulls last);
CREATE INDEX
Time: 67.089 ms
rds-9.6.5 root@db1=> vacuum analyze performance;
VACUUM
Time: 43.552 ms
We can add a WHERE
clause on the hour column so that using the index becomes a good idea - but notice how we still need to re-sort the data from the index.
rds-9.6.5 root@db1=> explain select hour from performance where hour>now() order by hour desc limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------
Limit (cost=4.45..4.46 rows=1 width=8)
-> Sort (cost=4.45..4.46 rows=1 width=8)
Sort Key: hour DESC
-> Index Only Scan using hour_idx on performance (cost=0.42..4.44 rows=1 width=8)
Index Cond: (hour > now())
(5 rows)
Time: 0.789 ms
If we add an explicit NULLS LAST
to your query then it will use the index as expected.
rds-9.6.5 root@db1=> explain select hour from performance order by hour desc NULLS LAST limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Limit (cost=0.42..0.68 rows=10 width=8)
-> Index Only Scan using hour_idx on performance (cost=0.42..4334.37 rows=166530 width=8)
(2 rows)
Time: 0.526 ms
Alternatively, if we drop the (non-default) NULLS LAST
from your index then the query will use it as expected without modification.
rds-9.6.5 root@db1=> drop index hour_idx;
DROP INDEX
Time: 4.124 ms
rds-9.6.5 root@db1=> create index hour_idx
[more] - > on performance
[more] - > using btree
[more] - > (hour desc);
CREATE INDEX
Time: 69.220 ms
rds-9.6.5 root@db1=> explain select hour from performance order by hour desc limit 10;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Limit (cost=0.42..0.68 rows=10 width=8)
-> Index Only Scan using hour_idx on performance (cost=0.42..4334.37 rows=166530 width=8)
(2 rows)
Time: 0.725 ms
Note that you can also drop the DESC
from your index; PostgreSQL can scan indexes both forwards and backwards and on single-column indexes it's generally unnecessary to reverse them. You only need to be careful about having the right combination of order and nulls first/last.
rds-9.6.5 root@db1=> drop index hour_idx;
DROP INDEX
Time: 3.837 ms
rds-9.6.5 root@db1=> create index hour_idx
[more] - > on performance
[more] - > using btree
[more] - > (hour);
CREATE INDEX
Time: 94.815 ms
rds-9.6.5 root@db1=> explain select hour from performance order by hour desc limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.68 rows=10 width=8)
-> Index Only Scan Backward using hour_idx on performance (cost=0.42..4334.37 rows=166530 width=8)
(2 rows)
Time: 0.740 ms
If most of your queries intend to select non-NULL values from hour
then you should consider building a partial index on those values, i.e. something like:
CREATE INDEX hour_not_null_idx ON performance (hour)
WHERE hour IS NOT NULL;
which, as long as you either query for a particular value of hour
, as Jeremy demonstrated in his answer, or add hour IS NOT NULL
to your WHERE
clause, will give you the same results, and possibly save you a little space as well:
# explain select hour from performance where hour > now() order by hour desc limit 10;
Limit (cost=0.42..5.30 rows=10 width=8)
-> Index Only Scan Backward using hour_not_null_idx on performance (cost=0.42..8.72 rows=17 width=8)
Index Cond: (hour > now())
If there are no NULL
values in the column, you should declare it NOT NULL
(I'm going to assume you know how to do this with ALTER TABLE ;o)), then create the index (without NULLS LAST
, since it's no longer important anyway). Then you get the same benefit:
william=# create index hour_idx on performance using btree ( hour );
CREATE INDEX
william=# explain select hour from performance order by hour desc limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------
Limit (cost=0.42..0.73 rows=10 width=8)
-> Index Only Scan Backward using hour_idx on performance (cost=0.42..5238.37 rows=166530 width=8)
(2 rows)