Accelerating the code for random selection of polygons
You can use spatial index by sindex
method in geopandas
. I've tested on three datasets include 100, 1000, 10000 points (instead of polygons), respectively. I've used different number of tiles.
# without spatial index (for loop in the question)
outputs = []
for tile in tiles:
poly = Polygon(tile)
ok = gdf[gdf.geometry.intersects(poly)]
if ok.shape[0] >= 1:
out = ok.sample(1)
outputs.append(out)
# with spatial index
sindex = gdf.sindex
outputs = []
for tile in tiles:
poly = Polygon(tile)
candidates_index = list(sindex.intersection(poly.bounds))
candidates = gdf.iloc[candidates_index]
matches = candidates[candidates.intersects(poly)]
if matches.shape[0] >= 1:
out = matches.sample(1)
outputs.append(out)
RESULTS: (times for for
loop in seconds)
Number Of No Index Index
Tiles Points (sec) (sec)
--------------------------------------------
100 0.10 0.10
40 1000 0.50 0.12
10000 3.50 0.23
--------------------------------------------
100 1.4 1.6
560 1000 5.6 1.6
10000 50 1.6
--------------------------------------------
100 3.5 4.5
1420 1000 15 4.5
10000 132 4.0
--------------------------------------------
100 8 10
3096 1000 34 10
10000 392 10
As you can see, increase in number of points increases times extremely when not using index, but no changing when using index. When using index, in that case, number of tiles is important.
EDIT: If you have memory problem with tiles
list, then you can use generator.
# Just change outer [] into (). tiles is not a list anymore, but a generator.
# convert tiles = [ ... ] to tiles = ( ... )
tiles = ([(ulx, uly), (ulx, lry), (lrx, lry), (lrx, uly)] for ulx, uly, lrx, lry in zip(ulx_s, uly_s, lrx_s, lry_s))
# remove print line. because a generator has no len function
print(len(tiles))
If there are (far) more polygons than grid cells, you should invert your computation, making the outer loop over the polygons. Something like:
for poly in polygons:
bb = boundingBox(poly)
compute list of grid cells intersecting/containing the bb. #Note this is NOT a polygon
#intersection, it's a simple comparison of bounds
for each overlapping grid cell, add poly to the list of overlapping boxes
for each cell in grid_cells:
sample one overlapping box from list
test to see if the polygon actually intersects the grid cell
if false, delete the box from the list and sample again
else add poly to your output
I also note that you are say you want 1km grid cells, but you're working in lat/lon coordinates and using a conversion of 0.008983157 degrees = 1km. That's correct for longitudes at the equator, but gets increasingly bad as you move away from the equator. You really should work in a projected coordinate system, like UTM, where the coordinates are in distance units.