Vectorizing Haversine distance calculation in Python
You would provide your function as an argument to np.vectorize()
, and could then use it as an argument to pandas.groupby.apply
as illustrated below:
haver_vec = np.vectorize(haversine, otypes=[np.int16])
distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))
For instance, with sample data as follows:
length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})
compare for 500 points:
def haver_vect(data):
distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
return distance
%timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop
%timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop
From haversine's function definition
, it looked pretty parallelizable. So, using one of the best tools for vectorization with NumPy aka broadcasting
and replacing the math funcs with the NumPy equivalents ufuncs
, here's one vectorized solution -
# Get data as a Nx2 shaped NumPy array
data = np.array(df['coordinates'].tolist())
# Convert to radians
data = np.deg2rad(data)
# Extract col-1 and 2 as latitudes and longitudes
lat = data[:,0]
lng = data[:,1]
# Elementwise differentiations for lattitudes & longitudes
diff_lat = lat[:,None] - lat
diff_lng = lng[:,None] - lng
# Finally Calculate haversine
d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
Runtime tests -
The other np.vectorize based solution
has shown some positive promise on performance improvement over the original code, so this section would compare the posted broadcasting based approach against that one.
Function definitions -
def vectotized_based(df):
haver_vec = np.vectorize(haversine, otypes=[np.int16])
return df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))
def broadcasting_based(df):
data = np.array(df['coordinates'].tolist())
data = np.deg2rad(data)
lat = data[:,0]
lng = data[:,1]
diff_lat = lat[:,None] - lat
diff_lng = lng[:,None] - lng
d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
Timings -
In [123]: # Input
...: length = 500
...: d1 = np.random.uniform(-90, 90, length)
...: d2 = np.random.uniform(-180, 180, length)
...: coords = tuple(zip(d1, d2))
...: df = pd.DataFrame({'id':np.arange(length), 'coordinates':coords})
...:
In [124]: %timeit vectotized_based(df)
1 loops, best of 3: 1.12 s per loop
In [125]: %timeit broadcasting_based(df)
10 loops, best of 3: 68.7 ms per loop