Multicolumn index and performance
Answer
Since you refer to the website use-the-index-luke.com
, consider the chapter:
Use The Index, Luke › The Where Clause › Searching For Ranges › Greater, Less and BETWEEN
It has an example that matches your situation perfectly (two-column index, one is tested for equality, the other for range), explains (with more of those nice index graphics) why @ypercube's advice is accurate and sums it up:
Rule of thumb: index for equality first — then for ranges.
Also good for just one column?
What to do for queries on just one column seems to be clear. More details and benchmarks concerning that under these related question:
- Working of indexes in PostgreSQL
- Is a composite index also good for queries on the first field?
Less selective column first?
Apart from that, what if you have only equality conditions for both columns?
It doesn't matter. Put the column first that is more likely to receive conditions of its own, which actually matters.
Consider this demo, or reproduce it yourself. I create a simple table of two columns with 100k rows. One with very few, the other one with lots of distinct values:
CREATE TEMP TABLE t AS
SELECT (random() * 10000)::int AS lots
, (random() * 4)::int AS few
FROM generate_series (1, 100000);
DELETE FROM t WHERE random() > 0.9; -- create some dead tuples, more "real-life"
ANALYZE t;
SELECT count(distinct lots) -- 9999
, count(distinct few) -- 5
FROM t;
Query:
SELECT *
FROM t
WHERE lots = 2345
AND few = 2;
EXPLAIN ANALYZE
output (Best of 10 to exclude caching effects):
Seq Scan on t (cost=0.00..5840.84 rows=2 width=8) (actual time=5.646..15.535 rows=2 loops=1) Filter: ((lots = 2345) AND (few = 2)) Buffers: local hit=443 Total runtime: 15.557 ms
Add index, retest:
CREATE INDEX t_lf_idx ON t(lots, few);
Index Scan using t_lf_idx on t (cost=0.00..3.76 rows=2 width=8) (actual time=0.008..0.011 rows=2 loops=1) Index Cond: ((lots = 2345) AND (few = 2)) Buffers: local hit=4 Total runtime: 0.027 ms
Add other index, retest:
DROP INDEX t_lf_idx;
CREATE INDEX t_fl_idx ON t(few, lots);
Index Scan using t_fl_idx on t (cost=0.00..3.74 rows=2 width=8) (actual time=0.007..0.011 rows=2 loops=1) Index Cond: ((few = 2) AND (lots = 2345)) Buffers: local hit=4 Total runtime: 0.027 ms
If, as you say, the queries involving these 2 columns, are all equality checks of both columns, e.g.:
WHERE c1=@ParameterA AND c2=@ParameterB
do not bother with this. I doubt there will be any difference and if there is one, it will be negligible. You can always test of course, with your data and your server settings. Different versions of a DBMS can behave slightly differently regarding optimization.
The order inside the index would matter for other types of queries, having checks of one column only, or inequality conditions, or conditions on one column and grouping in the other, etc.
If I were to choose one of the two orders, I'd choose to put the less selective column first. Consider a table with columns year
and month
. It's more probable that you need a WHERE year = 2000
condition or a WHERE year BETWEEN 2000 AND 2013
or a WHERE (year, month) BETWEEN (1999, 6) AND (2000, 5)
.
A query of the type WHERE month = 7 GROUP BY year
may be wanted sure (Find people born on July), but would be less often. That depends of course on the actual data stored in your table. Choose one order for now, say the (c1, c2)
and you can always add another index later (c2, c1)
.
Update, after the OP's comment:
There are also conditions like this:
WHERE c1 = 'abc' AND c2 LIKE 'ab%'
This type of query if exactly a range condition on c2
column and would need a (c1, c2)
index. If you also have queries of the reverse type:
WHERE c2 = 'abc' AND c1 LIKE 'ab%'
then it would be good if you had a (c2, c1)
index as well.