How to efficiently compute a rolling unique count in a pandas time series?
I had 2 errors in the fast method windowed_nunique
, now corrected in windowed_nunique_corrected
below:
- The size of the array for memoizing the number of unique counts for each person ID within the window,
pid_cts
, was too small. - Because the leading and trailing edges of the window include integer days,
date_min
should be updated when(date - date_min + 1) > window
.
Related links:
- Source Jupyter Notebook updated with solution: https://gist.github.com/stharrold/17589e6809d249942debe3a5c43d38cc
In [14]:
# Define a custom function and implement a just-in-time compiler.
@numba.jit(nopython=True)
def windowed_nunique_corrected(dates, pids, window):
r"""Track number of unique persons in window,
reading through arrays only once.
Args:
dates (numpy.ndarray): Array of dates as number of days since epoch.
pids (numpy.ndarray): Array of integer person identifiers.
Required: min(pids) >= 0
window (int): Width of window in units of difference of `dates`.
Required: window >= 1
Returns:
ucts (numpy.ndarray): Array of unique counts.
Raises:
AssertionError: Raised if not...
* len(dates) == len(pids)
* min(pids) >= 0
* window >= 1
Notes:
* Matches `pandas.core.window.Rolling`
with a time series alias offset.
"""
# Check arguments.
assert len(dates) == len(pids)
assert np.min(pids) >= 0
assert window >= 1
# Initialize counters.
idx_min = 0
idx_max = dates.shape[0]
date_min = dates[idx_min]
pid_min = pids[idx_min]
pid_max = np.max(pids) + 1
pid_cts = np.zeros(pid_max, dtype=np.int64)
pid_cts[pid_min] = 1
uct = 1
ucts = np.zeros(idx_max, dtype=np.int64)
ucts[idx_min] = uct
idx = 1
# For each (date, person)...
while idx < idx_max:
# Lookup date, person.
date = dates[idx]
pid = pids[idx]
# If person count went from 0 to 1, increment unique person count.
pid_cts[pid] += 1
if pid_cts[pid] == 1:
uct += 1
# For past dates outside of window...
# Note: If window=3, it includes day0,day1,day2.
while (date - date_min + 1) > window:
# If person count went from 1 to 0, decrement unique person count.
pid_cts[pid_min] -= 1
if pid_cts[pid_min] == 0:
uct -= 1
idx_min += 1
date_min = dates[idx_min]
pid_min = pids[idx_min]
# Record unique person count.
ucts[idx] = uct
idx += 1
return ucts
In [15]:
# Cast dates to integers.
df['DateEpoch'] = (df['Date'] - pd.to_datetime('1970-01-01'))/pd.to_timedelta(1, unit='D')
df['DateEpoch'] = df['DateEpoch'].astype(int)
In [16]:
%%timeit
windowed_nunique_corrected(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)
98.8 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [17]:
# Check accuracy of results.
test = windowed_nunique_corrected(
dates=df['DateEpoch'].values,
pids=df['PersonId'].values,
window=window)
assert all(ref == test)
Very close to your time in seed test two, but as a one liner, re sampled over a year.
df.resample('AS',on='Date')['PersonId'].expanding(0).apply(lambda x: np.unique(x).shape[0])
Time results
1 loop, best of 3: 483 ms per loop