Efficiently grouping a list of coordinates points by location in Python
Fundamentally, this is an image processing operation. If you use an image processing library like scikit-image (a.k.a. skimage
), it will be easy. Dealing with really huge data will eventually get slow, but 1024x1024 is nothing.
In [1]: import numpy as np
In [2]: import skimage.morphology
In [3]: x = [0,1,2,0,1,2,0,1,2,-3,-2,-1,-3,-2,-1,-3,-2,-1]
In [4]: y = [0,0,0,1,1,1,2,2,2,-3,-3,-3,-2,-2,-2,-1,-1,-1]
In [5]: dense = np.zeros((9,9), dtype=bool)
In [6]: dense[y,x] = True
In [7]: print(dense)
[[ True True True False False False False False False]
[ True True True False False False False False False]
[ True True True False False False False False False]
[False False False False False False False False False]
[False False False False False False False False False]
[False False False False False False False False False]
[False False False False False False True True True]
[False False False False False False True True True]
[False False False False False False True True True]]
In [8]: labeled = skimage.morphology.label(dense)
In [9]: print(labeled)
[[1 1 1 0 0 0 0 0 0]
[1 1 1 0 0 0 0 0 0]
[1 1 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 2 2 2]
[0 0 0 0 0 0 2 2 2]
[0 0 0 0 0 0 2 2 2]]
In [10]: coords_yx = { i: (labeled == i).nonzero() for i in range(1,labeled.max()+1) }
In [11]: coords_yx
Out[11]:
{1: (array([0, 0, 0, 1, 1, 1, 2, 2, 2]), array([0, 1, 2, 0, 1, 2, 0, 1, 2])),
2: (array([6, 6, 6, 7, 7, 7, 8, 8, 8]), array([6, 7, 8, 6, 7, 8, 6, 7, 8]))}
You can hash all coordinate points (e.g. using dictionary structure in python) and then for each coordinate point, hash the adjacent neighbors of the point to find pairs of points that are adjacent and "merge" them. Also, for each point you can maintain a pointer to the connected component that that point belongs to (using the dictionary structure), and for each connected component you maintain a list of points that belong to the component.
Then, when you hash a neighbor of a point and find a match, you merge the two connected component sets that the points belong to and update the group pointers for all new points in the union set. You can show that you only need to hash all the neighbors of all points just once and this will find all connected components, and furthermore, if you update the pointers for the smaller of the two connected components sets when two connected component sets are merged, then the run-time will be linear in the number of points.