How to improve performance of GeoIP query in BigQuery?
2019, much improved answer:
- https://medium.com/@hoffa/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds-e9e652480bd2
#standardSQL
# replace with your source of IP addresses
# here I'm using the same Wikipedia set from the previous article
WITH source_of_ip_addresses AS (
SELECT REGEXP_REPLACE(contributor_ip, 'xxx', '0') ip, COUNT(*) c
FROM `publicdata.samples.wikipedia`
WHERE contributor_ip IS NOT null
GROUP BY 1
)
SELECT country_name, SUM(c) c
FROM (
SELECT ip, country_name, c
FROM (
SELECT *, NET.SAFE_IP_FROM_STRING(ip) & NET.IP_NET_MASK(4, mask) network_bin
FROM source_of_ip_addresses, UNNEST(GENERATE_ARRAY(9,32)) mask
WHERE BYTE_LENGTH(NET.SAFE_IP_FROM_STRING(ip)) = 4
)
JOIN `fh-bigquery.geocode.201806_geolite2_city_ipv4_locs`
USING (network_bin, mask)
)
GROUP BY 1
ORDER BY 2 DESC
Cleaned up version of this answer at: http://googlecloudplatform.blogspot.com/2014/03/geoip-geolocation-with-google-bigquery.html
Let me tidy the original query:
SELECT
id,
client_ip,
client_ip_code,
B.Country_Name AS Country_Name
FROM (
SELECT
id,
contributor_ip AS client_ip,
INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
1 AS One
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip IS NOT NULL
LIMIT
1000
) AS A1
LEFT JOIN
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
1 AS One
FROM
--3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
FROM
[playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
,
(
SELECT
PriorRangeEndIP+1 From_IP_Code,
From_IP_Code-1 AS To_IP_Code,
'NA' AS Country_Name -- Missing rages lower FROM From_IP
from(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
,
LAG(To_IP_Code,
1,
INTEGER(0)) OVER(
ORDER BY
From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
FROM
[playscape-proj:GeoIP.GeoIP])A
WHERE
From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill
,
(
SELECT
MAX(To_IP_Code)+1 AS From_IP_Code,
INTEGER (4311810304) AS To_IP_Code,
'NA' AS Country_Name -- Missing rages higher tan Max To_IP
FROM
[playscape-proj:GeoIP.GeoIP])
) AS B
ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
A1.client_ip_code>=B.From_IP_Code
AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
OR (A1.client_ip_code IS NULL
AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;
That's a long query! (and a very interesting one). It runs in 14 seconds. How can we optimize it?
Some tricks I found:
- Skip NULLs. If there is no ip address in a log, don't try to match it.
- Reduce the combinations. Instead of JOINing every left side record with every right side record, how about joining only the 39.x.x.x records on the left side with the 39.x.x.x records on the right side. There are only a few (3 or 4) rules that cover multiple ranges. It would be easy to add a couple of rules on the geolite table to add rules to cover these gaps.
So I'm changing:
1 AS One
toINTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
(twice).- Adding a 'WHERE contributor_ip IS NOT NULL`.
And now it runs in 3 seconds! 5% of the ips could not be geolocated, probably by the described gaps (easy fix).
Now, how about going from the LIMIT 1000 to LIMIT 300000. How long will it take?
37 seconds! Much better than the described 25 minutes. If you want to go even higher, I would suggest turning the right side table into a static one - as once computed it doesn't change at all, it's just an expansion of the basic rules. Then you can use JOIN EACH.
SELECT
id,
client_ip,
client_ip_code,
B.Country_Name AS Country_Name
FROM (
SELECT
id,
contributor_ip AS client_ip,
INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip IS NOT NULL
LIMIT
300000
) AS A1
JOIN
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
INTEGER(From_IP_Code/(256*256*256)) AS One
FROM
--3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
FROM
[playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
,
(
SELECT
PriorRangeEndIP+1 From_IP_Code,
From_IP_Code-1 AS To_IP_Code,
'NA' AS Country_Name -- Missing rages lower FROM From_IP
from(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
,
LAG(To_IP_Code,
1,
INTEGER(0)) OVER(
ORDER BY
From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
FROM
[playscape-proj:GeoIP.GeoIP])A
WHERE
From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill
,
(
SELECT
MAX(To_IP_Code)+1 AS From_IP_Code,
INTEGER (4311810304) AS To_IP_Code,
'NA' AS Country_Name -- Missing rages higher tan Max To_IP
FROM
[playscape-proj:GeoIP.GeoIP])
) AS B
ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
A1.client_ip_code>=B.From_IP_Code
AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
OR (A1.client_ip_code IS NULL
AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;
As a cool addendum (see previous answer to get the details): What are the top countries contributing edits to Wikipedia?
Row Country_Name c
1 United States 36605405
2 United Kingdom 10355936
3 Canada 4988835
4 Australia 3387582
5 India 1447756
6 Germany 1414713
7 Philippines 765874
8 Netherlands 668850
9 Ireland 651370
10 France 602113
11 New Zealand 590554
12 Sweden 556544
....
Query complete (28.5s elapsed, 1.07 GB processed)
Query:
SELECT Country_Name, COUNT(*) c
FROM (
SELECT
id,
client_ip,
client_ip_code,
B.Country_Name AS Country_Name
FROM (
SELECT
id,
contributor_ip AS client_ip,
INTEGER(PARSE_IP(contributor_ip)) AS client_ip_code,
INTEGER(PARSE_IP(contributor_ip)/(256*256*256)) AS One
FROM
[publicdata:samples.wikipedia]
WHERE contributor_ip IS NOT NULL
-- NO LIMITS - use ALL the data!
) AS A1
JOIN
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
INTEGER(From_IP_Code/(256*256*256)) AS One
FROM
--3 IP sets: 1.valid ranges, 2.Gaps, 3. Gap at the END of the set
(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name
FROM
[playscape-proj:GeoIP.GeoIP]) -- all Ranges ov valid IPs
,
(
SELECT
PriorRangeEndIP+1 From_IP_Code,
From_IP_Code-1 AS To_IP_Code,
'NA' AS Country_Name -- Missing rages lower FROM From_IP
from(
SELECT
From_IP_Code,
To_IP_Code,
Country_Name,
LAG(To_IP_Code,
1,
INTEGER(0)) OVER(
ORDER BY
From_IP_Code ASC) PriorRangeEndIP --use of LAG function to find prior valid range
FROM
[playscape-proj:GeoIP.GeoIP])A
WHERE
From_IP_Code>PriorRangeEndIP+1) -- If gap FROM prior valid range IS >1 than its a gap to fill
,
(
SELECT
MAX(To_IP_Code)+1 AS From_IP_Code,
INTEGER (4311810304) AS To_IP_Code,
'NA' AS Country_Name -- Missing rages higher tan Max To_IP
FROM
[playscape-proj:GeoIP.GeoIP])
) AS B
ON A1.ONE=B.ONE --fake JOIN condition to overcome allowed use of = only IN joins
WHERE
A1.client_ip_code>=B.From_IP_Code
AND A1.client_ip_code<=B.To_IP_Code -- JOIN condition WHERE valid IP exists ON left
OR (A1.client_ip_code IS NULL
AND B.From_IP_Code=1 ) -- WHERE there IS no valid IP ON left contributor_ip;
)
GROUP BY 1 ORDER BY 2 DESC