How to apply an operation on a vector with offsets
Prospective approach
Looking at your sample data :
In [64]: start_end
Out[64]:
0 1 2
0 (1, 6) (4, 5) (6, 12)
1 (7, 10) (11, 12) (13, 19)
It is indeed non-overlapping for each row, but not across the entire dataset.
Now, we have np.ufunc.reduceat
that gives us ufunc reduction for each slice :
ufunc(ar[indices[i]: indices[i + 1]])
as long as indices[i] < indices[i+1]
.
So, with ufunc(ar, indices)
, we would get :
[ufunc(ar[indices[0]: indices[1]]), ufunc(ar[indices[1]: indices[2]]), ..]
In our case, for each tuple (x,y)
, we know x<y
. With stacked version, we have :
[(x1,y1), (x2,y2), (x3,y3), ...]
If we flatten, it would be :
[x1,y1,x2,y2,x3,y3, ...]
So, we might not have y1<x2
, but that's okay, because we don't need ufunc reduction for that one and similarly for the pair : y2,x3
. But that's okay as they could be skipped with a stepsize slicing of the final output.
Thus, we would have :
# Inputs : a (1D array), start_end (2D array of shape (N,2))
lens = start_end[:,1]-start_end[:,0]
out = np.add.reduceat(a, start_end.ravel())[::2]/lens
np.add.reduceat()
part gives us the sliced summations. We needed the division by lens
for the average computations.
Sample run -
In [47]: a
Out[47]:
array([0.49264042, 0.00506412, 0.61419663, 0.77596769, 0.50721381,
0.76943416, 0.83570173, 0.2085408 , 0.38992344, 0.64348176,
0.3168665 , 0.78276451, 0.03779647, 0.33456905, 0.93971763,
0.49663649, 0.4060438 , 0.8711461 , 0.27630025, 0.17129342])
In [48]: start_end
Out[48]:
array([[ 1, 3],
[ 4, 5],
[ 6, 12],
[ 7, 10],
[11, 12],
[13, 19]])
In [49]: [np.mean(a[i:j]) for (i,j) in start_end]
Out[49]:
[0.30963037472653104,
0.5072138121177008,
0.5295464559328862,
0.41398199978967815,
0.7827645134019902,
0.5540688880441684]
In [50]: lens = start_end[:,1]-start_end[:,0]
...: out = np.add.reduceat(a, start_end.ravel())[::2]/lens
In [51]: out
Out[51]:
array([0.30963037, 0.50721381, 0.52954646, 0.413982 , 0.78276451,
0.55406889])
For completeness, referring back to given sample, the conversion steps were :
# Given start_end as df and values as a 2D array
start_end = np.vstack(np.concatenate(start_end.values))
a = values.ravel()
For other ufuncs that have reduceat
method, we will just replace np.add.reduceat
For computing mean in your case, you'll never go as fast as if you precompute cumulative sums first using numpy.cumsum for instance. Check out the following code:
import numpy as np
import pandas as pd
import time
R = 1_000
C = 10_000
M = 100
# Generation of test case
start = np.random.randint(0, M-1, (R*C,1))
end = np.random.randint(0, M-1, (R*C,1))
start = np.where(np.logical_and(start>=end, end>1), end-1, start)
end = np.where(np.logical_and(start>=end, start<M-1), start+1, end)
start_end = np.hstack((start, end))
values = np.random.rand(M)
t_start = time.time()
# Basic mean dataframe
lens = start_end[:,1]-start_end[:,0]
mean = np.add.reduceat(values, start_end.ravel())[::2]/lens
print('Timre 1:', time.time()-t_start, 's')
t_start = time.time()
#Cumulative sum
cum_values = np.zeros((values.size+1,))
cum_values[1:] = np.cumsum(values)
# Compute mean dataframe
mean_2 = (cum_values[start_end[:,1]]-cum_values[start_end[:,0]])/(start_end[:,1]-start_end[:,0])
print('Timre 2:', time.time()-t_start, 's')
print('Results are equal!' if np.allclose(mean, mean_2) else 'Results differ!')
print('Norm of the difference:', np.linalg.norm(mean - mean_2))
Output:
% python3 script.py
Timre 1: 0.48940515518188477 s
Timre 2: 0.16983389854431152 s
Results are equal!
Norm of the difference: 2.545241707481022e-12
The difference in performance gets even worse when M
increases. For M=5000
you get:
% python3 script.py
Timre 1: 4.5356669425964355 s
Timre 2: 0.1772768497467041 s
Results are equal!
Norm of the difference: 1.0660592585125616e-10