Find group of consecutive dates in Pandas DataFrame

It seems like you need two boolean masks: one to determine the breaks between groups, and one to determine which dates are in a group in the first place.

There's also one tricky part that can be fleshed out by example. Notice that df below contains an added row that doesn't have any consecutive dates before or after it.

>>> df
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253
4   2017-01-20  0.485949  # < watch out for this
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

>>> df.dtypes
DateAnalyzed    datetime64[ns]
Val                    float64
dtype: object

The answer below assumes that you want to ignore 2017-01-20 completely, without processing it. (See end of answer for a solution if you do want to process this date.)

First:

>>> dt = df['DateAnalyzed']
>>> day = pd.Timedelta('1d')
>>> in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
>>> in_block
1     True
2     True
3     True
4    False
5     True
6     True
7     True
Name: DateAnalyzed, dtype: bool

Now, in_block will tell you which dates are in a "consecutive" block, but it won't tell you to which groups each date belongs.

The next step is to derive the groupings themselves:

>>> filt = df.loc[in_block]
>>> breaks = filt['DateAnalyzed'].diff() != day
>>> groups = breaks.cumsum()
>>> groups
1    1
2    1
3    1
5    2
6    2
7    2
Name: DateAnalyzed, dtype: int64

Then you can call df.groupby(groups) with your operation of choice.

>>> for _, frame in filt.groupby(groups):
...     print(frame, end='\n\n')
... 
  DateAnalyzed       Val
1   2018-03-18  0.470253
2   2018-03-19  0.470253
3   2018-03-20  0.470253

  DateAnalyzed       Val
5   2018-09-25  0.467729
6   2018-09-26  0.467729
7   2018-09-27  0.467729

To incorporate this back into df, assign to it and the isolated dates will be NaN:

>>> df['groups'] = groups
>>> df
  DateAnalyzed       Val  groups
1   2018-03-18  0.470253     1.0
2   2018-03-19  0.470253     1.0
3   2018-03-20  0.470253     1.0
4   2017-01-20  0.485949     NaN
5   2018-09-25  0.467729     2.0
6   2018-09-26  0.467729     2.0
7   2018-09-27  0.467729     2.0

If you do want to include the "lone" date, things become a bit more straightforward:

dt = df['DateAnalyzed']
day = pd.Timedelta('1d')
breaks = dt.diff() != day
groups = breaks.cumsum()

There were similar questions after this one here and here, with more specific outputs requirements. Since this one is more general, I would like to contribute here as well.

We can easily assign an unique identifier to consecutive groups with one-line code:

df['grp_date'] = df.DateAnalyzed.diff().dt.days.ne(1).cumsum()

Here, every time we see a date with a difference greater than a day, we add a value to that date, otherwise it remains with the previous value so that we end up with a unique identifier per group.

See the output:

  DateAnalyzed       Val  grp_date
1   2018-03-18  0.470253         1
2   2018-03-19  0.470253         1
3   2018-03-20  0.470253         1
4   2018-09-25  0.467729         2
5   2018-09-26  0.467729         2
6   2018-09-27  0.467729         2

Now, it's easy to groupby "grp_date" and do whatever you wanna do with apply or agg.

Examples:

# Sum across consecutive days (or any other method from pandas groupby)
df.groupby('grp_date').sum()

# Get the first value and last value per consecutive days
df.groupby('grp_date').apply(lambda x: x.iloc[[0, -1]])
# or df.groupby('grp_date').head(n) for first n days

# Perform custom operation across target-columns
df.groupby('grp_date').apply(lambda x: (x['col1'] + x['col2']) / x['Val'].mean())

# Multiple operations for a target-column
df.groupby('grp_date').Val.agg(['min', 'max', 'mean', 'std'])

# and so on...

Find group of consecutive dates in Pandas DataFrame

Tags:

Python

Pandas

Datetime

Related

Recent Posts