How to split a pandas time-series by NAN values
You can use numpy.split
and then filter the resulting list. Here is one example assuming that the column with the values is labeled "value"
:
events = np.split(df, np.where(np.isnan(df.value))[0])
# removing NaN entries
events = [ev[~np.isnan(ev.value)] for ev in events if not isinstance(ev, np.ndarray)]
# removing empty DataFrames
events = [ev for ev in events if not ev.empty]
You will have a list with all the events separated by the NaN
values.
Note, this answer is for pandas<0.25.0, if you're using 0.25.0 or greater see this answer by thesofakillers
I found an efficient solution for very large and sparse datasets. In my case, hundreds of thousands of rows with only a dozen or so brief segments of data between NaN
values. I (ab)used the internals of pandas.SparseIndex
, which is a feature to help compress sparse datasets in memory.
Given some data:
import pandas as pd
import numpy as np
# 10 days at per-second resolution, starting at midnight Jan 1st, 2011
rng = pd.date_range('1/1/2011', periods=10 * 24 * 60 * 60, freq='S')
dense_ts = pd.Series(np.nan, index=rng, dtype=np.float64)
# Three blocks of non-null data throughout timeseries
dense_ts[500:510] = np.random.randn(10)
dense_ts[12000:12015] = np.random.randn(15)
dense_ts[20000:20050] = np.random.randn(50)
Which looks like:
2011-01-01 00:00:00 NaN
2011-01-01 00:00:01 NaN
2011-01-01 00:00:02 NaN
2011-01-01 00:00:03 NaN
..
2011-01-10 23:59:56 NaN
2011-01-10 23:59:57 NaN
2011-01-10 23:59:58 NaN
2011-01-10 23:59:59 NaN
Freq: S, Length: 864000, dtype: float64
We can find the blocks efficiently and easily:
# Convert to sparse then query index to find block locations
sparse_ts = dense_ts.to_sparse()
block_locs = zip(sparse_ts.sp_index.blocs, sparse_ts.sp_index.blengths)
# Map the sparse blocks back to the dense timeseries
blocks = [dense_ts.iloc[start:(start + length - 1)] for (start, length) in block_locs]
Voila:
[2011-01-01 00:08:20 0.531793
2011-01-01 00:08:21 0.484391
2011-01-01 00:08:22 0.022686
2011-01-01 00:08:23 -0.206495
2011-01-01 00:08:24 1.472209
2011-01-01 00:08:25 -1.261940
2011-01-01 00:08:26 -0.696388
2011-01-01 00:08:27 -0.219316
2011-01-01 00:08:28 -0.474840
Freq: S, dtype: float64, 2011-01-01 03:20:00 -0.147190
2011-01-01 03:20:01 0.299565
2011-01-01 03:20:02 -0.846878
2011-01-01 03:20:03 -0.100975
2011-01-01 03:20:04 1.288872
2011-01-01 03:20:05 -0.092474
2011-01-01 03:20:06 -0.214774
2011-01-01 03:20:07 -0.540479
2011-01-01 03:20:08 -0.661083
2011-01-01 03:20:09 1.129878
2011-01-01 03:20:10 0.791373
2011-01-01 03:20:11 0.119564
2011-01-01 03:20:12 0.345459
2011-01-01 03:20:13 -0.272132
Freq: S, dtype: float64, 2011-01-01 05:33:20 1.028268
2011-01-01 05:33:21 1.476468
2011-01-01 05:33:22 1.308881
2011-01-01 05:33:23 1.458202
2011-01-01 05:33:24 -0.874308
..
2011-01-01 05:34:02 0.941446
2011-01-01 05:34:03 -0.996767
2011-01-01 05:34:04 1.266660
2011-01-01 05:34:05 -0.391560
2011-01-01 05:34:06 1.498499
2011-01-01 05:34:07 -0.636908
2011-01-01 05:34:08 0.621681
Freq: S, dtype: float64]