Slow query in PostgreSQL selecting a single row from between a range defined in two columns
This is a little bit of a different solution to those already offered which involved using spacial indexes to do some tricks.
Instead its worth remembering that with IP addresses you cannot have overlapping ranges. That is A -> B
cannot intersect X -> Y
in any way. Knowing this you can change your SELECT
query slightly and take advantage of this trait. In taking advantage of this trait you do not have to have any "clever" indexing at all. In fact you only need to index your ip_from
column.
Previously, the query being analyzed was:
SELECT * FROM ip2location_db11 WHERE 2538629520 BETWEEN ip_from AND ip_to;
Lets assume that the range that 2538629520
falls into happens to be 2538629512
and 2538629537
.
Note: It really does not matter what the range is, this is just to help demonstrate the pattern we can take advantage of.
From this we can assume that the next ip_from
value is 2538629538
. We actually dont need to worry about any records above this ip_from
value. Indeed all we actually care about is the range where ip_from
equals 2538629512
.
Knowing this fact, our query actually becomes (in English):
Find me the maximum
ip_from
value where my IP address is higher thanip_from
. Show me the record where you find this value.Or in otherwords: Find me the
ip_from
value just before my IP address and give me that record
Because we never have overlapping ranges of ip_from
to ip_to
this holds true and allows us to write the query as:
SELECT *
FROM ip2location
WHERE ip_from = (
SELECT MAX(ip_from)
FROM ip2location
WHERE ip_from <= 2538629520
)
Back to the indexing to take advantage of all this. All we're actually looking across is ip_from and we are doing integer comparisons. The MIN(ip_from) have PostgreSQL find the first record available. This is good because we can seek right to that and then not worry about any other records at all.
All we really need is an index like:
CREATE UNIQUE INDEX CONCURRENTLY ix_ip2location_ipFrom ON public.ip2location(ip_from)
We can make the index unique because we will not have overlapping records. I would even make this column the primary key myself.
With this index and this query, the explain plan is:
Index Scan using ix_ip2location_ipfrom on public.ip2location (cost=0.90..8.92 rows=1 width=69) (actual time=0.530..0.533 rows=1 loops=1)
Output: ip2location.ip_from, ip2location.ip_to, ip2location.country_code, ip2location.country_name, ip2location.region_name, ip2location.city_name, ip2location.latitude, ip2location.longitude, ip2location.zip_code, ip2location.time_zone
Index Cond: (ip2location.ip_from = $1)
InitPlan 2 (returns $1)
-> Result (cost=0.46..0.47 rows=1 width=8) (actual time=0.452..0.452 rows=1 loops=1)
Output: $0
InitPlan 1 (returns $0)
-> Limit (cost=0.43..0.46 rows=1 width=8) (actual time=0.443..0.444 rows=1 loops=1)
Output: ip2location_1.ip_from
-> Index Only Scan using ix_ip2location_ipfrom on public.ip2location ip2location_1 (cost=0.43..35440.79 rows=1144218 width=8) (actual time=0.438..0.438 rows=1 loops=1)
Output: ip2location_1.ip_from
Index Cond: ((ip2location_1.ip_from IS NOT NULL) AND (ip2location_1.ip_from >= '2538629520'::bigint))
Heap Fetches: 0
To give you an idea of improvement in query performance with this approach I tested this on my Raspberry Pi. The original approach took approx 4 secs. This approach takes approx 120ms. The big win is from individual row seeks verses some scans. The original query would suffer EXTREMELY from low range values as more of the table needs to be considered in the results. This query will exhibit consistent performance across the range of values.
Hope this helps and my explanation makes sense to you all.