Efficiently find overlap of date-time ranges from 2 dataframes
Based on timeit
tests, with 100 executions each, the namedtuple
approach in the question averaged 15.7314
seconds on my machine, vs. an average of 1.4794
seconds with this approach:
# determine the duration of the events in df2, in seconds
duration = (df2.datetime_end - df2.datetime_start).dt.seconds.values
# create a numpy array with one timestamp for each second
# in which an event occurred
seconds_range = np.repeat(df2.datetime_start.values, duration) + \
np.concatenate(map(np.arange, duration)) * pd.Timedelta('1S')
df1.merge(pd.DataFrame({'datetime_start':seconds_range,
'catg':np.repeat(df2.catg, duration)}). \
groupby(['catg', pd.Grouper(key='datetime_start', freq='30min')]). \
size(). \
unstack(level=0). \
reset_index(),
how="left")
# datetime_end datetime_start a b c d
# 0 2016-09-11 06:30:00 2016-09-11 06:00:00 NaN NaN NaN NaN
# 1 2016-09-11 07:30:00 2016-09-11 07:00:00 NaN NaN NaN NaN
# 2 2016-09-11 08:00:00 2016-09-11 07:30:00 NaN NaN NaN NaN
# 3 2016-09-11 08:30:00 2016-09-11 08:00:00 NaN NaN NaN NaN
# 4 2016-09-11 09:00:00 2016-09-11 08:30:00 687.0 NaN NaN NaN
# 5 2016-09-11 09:30:00 2016-09-11 09:00:00 1800.0 NaN NaN NaN
# 6 2016-09-11 10:00:00 2016-09-11 09:30:00 1048.0 NaN NaN NaN
# 7 2016-09-11 11:00:00 2016-09-11 10:30:00 NaN NaN NaN NaN
# 8 2016-09-11 14:30:00 2016-09-11 14:00:00 NaN 463.0 NaN 701.0
# 9 2016-09-11 15:00:00 2016-09-11 14:30:00 NaN 220.0 NaN NaN
# 10 2016-09-11 15:30:00 2016-09-11 15:00:00 NaN 300.0 NaN 1277.0
# 11 2016-09-11 16:00:00 2016-09-11 15:30:00 1316.0 NaN NaN 89.0
# 12 2016-09-11 16:30:00 2016-09-11 16:00:00 564.0 680.0 NaN NaN
# 13 2016-09-11 17:00:00 2016-09-11 16:30:00 NaN 1654.0 NaN NaN
# 14 2016-09-11 17:30:00 2016-09-11 17:00:00 NaN 389.0 20.0 NaN
Assuming both df1
and df2
are sorted in ascending order by the datetime_start
column (it appears so), then you just need to go through each row of the two dataframes once, resulting in an O(n)
running time, rather than the current O(n^2)
due to pairwise row comparison.
The following code illustrates the idea. The key point is using iterators it1
and it2
to point to the current row. Since the dataframes are sorted, if row1 is already later in time than row2, we are sure that the next row in df1 is later than row2. Harder to explain in words than code:
def main(df1, df2):
for cat in df2.catg.unique().tolist():
df1[cat] = 0
it1 = df1.iterrows()
it2 = df2.iterrows()
idx1, row1 = next(it1)
idx2, row2 = next(it2)
while True:
try:
r1 = Range(start=row1.datetime_start, end=row1.datetime_end)
r2 = Range(start=row2.datetime_start, end=row2.datetime_end)
if r2.end < r1.start:
# no overlap. r2 before r1. advance it2
idx2, row2 = next(it2)
elif r1.end < r2.start:
# no overlap. r1 before r2. advance it1
idx1, row1 = next(it1)
else:
# overlap. overlap(row1, row2) must > 0
df1.loc[idx1, row2.catg] += overlap(row1, row2)
# determine whether to advance it1 or it2
if r1.end < r2.end:
# advance it1
idx1, row1 = next(it1)
else:
# advance it2
idx2, row2 = next(it2)
except StopIteration:
break
main(df1, df2)
You should see a significant (~8x in my testing) performance improvement through a few changes. The structure of your code remains the same:
def overlap(row1, row2):
return max(0, (min(row1[0], row2[0]) - max(row1[1], row2[1])) / np.timedelta64(1, 's'))
df1 = df1.join(pd.DataFrame(dict.fromkeys(df2.catg.unique(), 0), index=df1.index))
for idx1, row1 in enumerate(df1.iloc[:, :2].values):
for catg, row2 in zip(df2['catg'], df2.iloc[:, 1:3].values):
df1.iat[idx1, df1.columns.get_loc(catg)] += overlap(row1, row2)
You can get this down further via numba
, or doing some clever pandas
stuff which hides all your logic.
Explanation
- Use
df.itertuples
instead ofdf.iterrows
- Use
df.iat
instead ofdf.loc
- Use
numpy
instead ofpandas
time objects - Remove named tuple creation
- Remove duplicate overlap calculation
- Improve overlap algorithm
Result
datetime_end datetime_start a b c d
0 2016-09-11 06:30:00 2016-09-11 06:00:00 0 0 0 0
1 2016-09-11 07:30:00 2016-09-11 07:00:00 0 0 0 0
2 2016-09-11 08:00:00 2016-09-11 07:30:00 0 0 0 0
3 2016-09-11 08:30:00 2016-09-11 08:00:00 0 0 0 0
4 2016-09-11 09:00:00 2016-09-11 08:30:00 687 0 0 0
5 2016-09-11 09:30:00 2016-09-11 09:00:00 1800 0 0 0
6 2016-09-11 10:00:00 2016-09-11 09:30:00 1048 0 0 0
7 2016-09-11 11:00:00 2016-09-11 10:30:00 0 0 0 0
13 2016-09-11 14:30:00 2016-09-11 14:00:00 0 463 0 701
14 2016-09-11 15:00:00 2016-09-11 14:30:00 0 220 0 0
15 2016-09-11 15:30:00 2016-09-11 15:00:00 0 300 0 1277
16 2016-09-11 16:00:00 2016-09-11 15:30:00 1316 0 0 89
17 2016-09-11 16:30:00 2016-09-11 16:00:00 564 680 0 0
18 2016-09-11 17:00:00 2016-09-11 16:30:00 0 1654 0 0
19 2016-09-11 17:30:00 2016-09-11 17:00:00 0 389 20 0