Looking for fastest solution for Point in Polygon analysis of 200 million points
ST_DWithin was faster in my test than ST_Intersects. That is surprising, especially since the prepared geometry algorithm is supposed to kick in on cases like this. I think there is a chance that this will be quite a lot faster than I showed here.
I did some more tests and two things almost 10-doubled the speed. First, I tried on a newer computer, but still a quite ordinary laptop, maybe except from SATA3 ssd -disks.
Then the query below took 18 seconds instead of 62 seconds on the old laptop. Next I found that I was totally wrong before when I wrote that the index on the point-table wasn't necessary. With that index in place ST_Intersects behaved as expected and things became very fast. I increased the number of points in the point-table to 1 million points and the query:
CREATE TABLE points_ct AS
SELECT imported_ct.gid as ct_gid, t.gid as point_id
FROM imported_ct , t WHERE ST_Intersects(imported_ct.geom , t.geom);
runs in 72 seconds. Since there is 1249 polygons, 1249000000 tests is done in 72 seconds. That makes about 17000000 tests per second. Or testing almost 14000 points against all polygons per second.
From this test your 400000000 points to test should take about 8 hours without any trouble with distributing the load to several cores. PostGIS never stops to impress me :-)
First, to visualize the result you can add the point geometry to the resulting table, open it in QGIS for instance and style it with unique values on the imported_ct field.
Second, yes, you can get also the points falling outside any polygon by using a right (or left) join like this:
CREATE TABLE points_ct AS
SELECT imported_ct.gid as ct_gid, t.gid as point_id
FROM imported_ct right join t ON ST_Intersects(imported_ct.the_geom , t.geom);
I did some tests to verify if it seems possible PostGIS.
First something I don't understand. You have two points per row. Is always both points in the same polygon? Then it is enough to do the calculations on one of the points. If they can be in two different polygons you will need a way to connect one point row to two polygons.
From the tests it seems like doable, but you might need some creative solution to spread the load over more than one cpu-core.
I tested on a 4 year old laptop with dual core centrino cpu (about 2.2GHz I think) , 2GB RAM. If you have 48 BG RAM I guess you have a lot more cpu-power too.
What I did was to create a random point table with 100000 points like this:
CREATE TABLE t AS
WITH r AS
(SELECT ST_Extent(the_geom)::geometry ext FROM imported_ct)
SELECT ST_Point(x,y) AS geom FROM
(SELECT GENERATE_SERIES(1,100000)) s,
(SELECT ST_Xmin(ext)+(random()*(ST_Xmax(ext)-ST_Xmin(ext))) x, ST_Ymin(ext)+(random()*(ST_Ymax(ext)-ST_Ymin(ext))) y FROM r
) f;
Then adding a gid like:
ALTER TABLE t ADD COLUMN GID SERIAL;
Then running :
CREATE TABLE points_ct AS
SELECT imported_ct.gid as ct_gid, t.gid as point_id FROM imported_ct , t WHERE ST_Dwithin(imported_ct.the_geom , t.geom,0);
takes about 62 seconds (compare to your ArcGIS result with the same amount of points). The result is a table connecting the points in my table t with the gid in the table with census tract.
With that speed you will do 200 mill points in about 34 hours. So, if it is enough with checking one of the point, my old laptop can do it with one core.
But if you need to check both points it might be harder.
THen you can manually distribute the load to more than one core by starting multiple sessions against the db and run different queries.
In my example with 50000 points and two cpu-cores I tried :
CREATE TABLE t1 as
SELECT imported_ct.gid as ct_gid, t.gid as point_id FROM imported_ct , t WHERE t.gid >50000 and ST_Dwithin(imported_ct.the_geom , t.geom,0);
on one db-session at the same time as running:
CREATE TABLE t2 as
SELECT imported_ct.gid as ct_gid, t.gid as point_id FROM imported_ct , t WHERE t.gid <=50000 and ST_Dwithin(imported_ct.the_geom , t.geom,0);
on another db-session.
That took about 36 seconds so it is a little bit slower than the first example probably depending on disc writing at the same time. But since bith cores are working at the same time it didn't take more than 36 seconds of my time.
To union table t1 and t2 a tried:
CREATE TABLE t3 AS
SELECT * FROM t1
UNION ALL
SELECT * FROM t2;
using about half a second.
So, with fresher hardware and distributing the load over many cores this should absolutely be possible even if the real world will be slower than the test case.
Might be worth noting that the example is from Linux (Ubuntu). Using Windows will be another story. But I have all other daily applications running so the laptop is quite heavily loaded from before. So that might simulate the windows case quite well maybe, without opening anything but pgadmin.
Probably the easiest way is with PostGIS. There are some tutorials on the internet how to import csv/txt point data into PostGIS. Link1
I am not sure about performance of point-in-polygon searches in PostGIS; it should be faster than ArcGIS. GIST spatial index that PostGIS uses is pretty fast. Link2 Link3
You could also test MongoDB geospatial index. But this requires little more time to get started. I believe that MongoDB could be really fast. I haven’t tested it with point-in-polygon searches so cannot be sure.