How to calculate distance for every row in a pandas dataframe from a single point efficiently?
You can compute vectorized Euclidean distance (L2 norm) using the formula
sqrt((a1 - b1)2 + (a2 - b2)2 + ...)
df.sub(point, axis=1).pow(2).sum(axis=1).pow(.5)
0 0.474690
1 0.257080
2 0.703857
3 0.503596
4 0.461151
dtype: float64
Which gives the same output as your current code.
Or, using linalg.norm
:
np.linalg.norm(df.to_numpy() - point, axis=1)
# array([0.47468985, 0.25707985, 0.70385676, 0.5035961 , 0.46115096])
Another option is use cdist
which is a bit faster:
from scipy.spatial.distance import cdist
cdist(point[None,], df.values)
Output:
array([[0.47468985, 0.25707985, 0.70385676, 0.5035961 , 0.46115096]])
Some comparison with 100k rows:
%%timeit -n 10
cdist([point], df.values)
645 µs ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
np.linalg.norm(df.to_numpy() - point, axis=1)
5.16 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df.sub(point, axis=1).pow(2).sum(axis=1).pow(.5)
16.8 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Let us do scipy
from scipy.spatial import distance
ary = distance.cdist(df.values, np.array([point]), metric='euclidean')
ary
Out[57]:
array([[0.47468985],
[0.25707985],
[0.70385676],
[0.5035961 ],
[0.46115096]])