Counting Consecutive Duplicates For By Group
Sample:
df = pd.DataFrame({'ID': [79, 79, 79, 79, 79, 79, 80, 80, 80, 80, 80, 80, 80],
'tDate': [pd.Timestamp('2019-07-12 00:00:00'),
pd.Timestamp('2019-07-13 00:00:00'),
pd.Timestamp('2019-07-18 00:00:00'),
pd.Timestamp('2019-07-19 00:00:00'),
pd.Timestamp('2019-07-20 00:00:00'),
pd.Timestamp('2019-08-03 00:00:00'),
pd.Timestamp('2019-06-21 00:00:00'),
pd.Timestamp('2019-06-22 00:00:00'),
pd.Timestamp('2019-07-18 00:00:00'),
pd.Timestamp('2019-07-19 00:00:00'),
pd.Timestamp('2019-07-26 00:00:00'),
pd.Timestamp('2019-08-02 00:00:00'),
pd.Timestamp('2019-08-03 00:00:00')],
'value':[397, 404, 405, 406, 408, 413, 397, 404, 405, 406, 408, 410, 413]})
print (df)
ID tDate value
0 79 2019-07-12 397
1 79 2019-07-13 404
2 79 2019-07-18 405
3 79 2019-07-19 406
4 79 2019-07-20 408
5 79 2019-08-03 413
6 80 2019-06-21 397
7 80 2019-06-22 404
8 80 2019-07-18 405
9 80 2019-07-19 406
10 80 2019-07-26 408
11 80 2019-08-02 410
12 80 2019-08-03 413
Solution:
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d'))
s = (~a).cumsum()
df['consec_count']=np.where(a.groupby(s).transform('any'), df.groupby(s).cumcount(1).add(1),0)
print (df)
ID tDate value consec_count
0 79 2019-07-12 397 1
1 79 2019-07-13 404 2
2 79 2019-07-18 405 1
3 79 2019-07-19 406 2
4 79 2019-07-20 408 3
5 79 2019-08-03 413 0
6 80 2019-06-21 397 1
7 80 2019-06-22 404 2
8 80 2019-07-18 405 1
9 80 2019-07-19 406 2
10 80 2019-07-26 408 0
11 80 2019-08-02 410 1
12 80 2019-08-03 413 2
Explanation:
First create mask for compare difference per groups by DataFrameGroupBy.diff
with one day:
print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d'))))
ID tDate value diff a
0 79 2019-07-12 397 NaT False
1 79 2019-07-13 404 1 days True
2 79 2019-07-18 405 5 days False
3 79 2019-07-19 406 1 days True
4 79 2019-07-20 408 1 days True
5 79 2019-08-03 413 14 days False
6 80 2019-06-21 397 NaT False
7 80 2019-06-22 404 1 days True
8 80 2019-07-18 405 26 days False
9 80 2019-07-19 406 1 days True
10 80 2019-07-26 408 7 days False
11 80 2019-08-02 410 7 days False
12 80 2019-08-03 413 1 days True
Create unique groups by Series.cumsum
with inverted condition by ~
:
print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
a_neg = ~a,
s = (~a).cumsum()))
ID tDate value diff a a_neg s
0 79 2019-07-12 397 NaT False True 1
1 79 2019-07-13 404 1 days True False 1
2 79 2019-07-18 405 5 days False True 2
3 79 2019-07-19 406 1 days True False 2
4 79 2019-07-20 408 1 days True False 2
5 79 2019-08-03 413 14 days False True 3
6 80 2019-06-21 397 NaT False True 4
7 80 2019-06-22 404 1 days True False 4
8 80 2019-07-18 405 26 days False True 5
9 80 2019-07-19 406 1 days True False 5
10 80 2019-07-26 408 7 days False True 6
11 80 2019-08-02 410 7 days False True 7
12 80 2019-08-03 413 1 days True False 7
Crete mask by GroupBy.transform
and DataFrameGroupBy.any
for test if each group contains at least one True
- then all values of group are set to True
s:
print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
a_neg = ~a,
s = (~a).cumsum(),
mask = a.groupby(s).transform('any')))
ID tDate value consec_count diff a a_neg s mask
0 79 2019-07-12 397 1 NaT False True 1 True
1 79 2019-07-13 404 2 1 days True False 1 True
2 79 2019-07-18 405 1 5 days False True 2 True
3 79 2019-07-19 406 2 1 days True False 2 True
4 79 2019-07-20 408 3 1 days True False 2 True
5 79 2019-08-03 413 0 14 days False True 3 False
6 80 2019-06-21 397 1 NaT False True 4 True
7 80 2019-06-22 404 2 1 days True False 4 True
8 80 2019-07-18 405 1 26 days False True 5 True
9 80 2019-07-19 406 2 1 days True False 5 True
10 80 2019-07-26 408 0 7 days False True 6 False
11 80 2019-08-02 410 1 7 days False True 7 True
12 80 2019-08-03 413 2 1 days True False 7 True
Create counter per groups s
by GroupBy.cumcount
:
print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
a_neg = ~a,
s = (~a).cumsum(),
mask = a.groupby(s).transform('any'),
c = df.groupby(s).cumcount(1).add(1)))
ID tDate value consec_count diff a a_neg s mask c
0 79 2019-07-12 397 1 NaT False True 1 True 1
1 79 2019-07-13 404 2 1 days True False 1 True 2
2 79 2019-07-18 405 1 5 days False True 2 True 1
3 79 2019-07-19 406 2 1 days True False 2 True 2
4 79 2019-07-20 408 3 1 days True False 2 True 3
5 79 2019-08-03 413 0 14 days False True 3 False 1
6 80 2019-06-21 397 1 NaT False True 4 True 1
7 80 2019-06-22 404 2 1 days True False 4 True 2
8 80 2019-07-18 405 1 26 days False True 5 True 1
9 80 2019-07-19 406 2 1 days True False 5 True 2
10 80 2019-07-26 408 0 7 days False True 6 False 1
11 80 2019-08-02 410 1 7 days False True 7 True 1
12 80 2019-08-03 413 2 1 days True False 7 True 2
And last add 0
by numpy.where
with mask mask
:
print (df.assign(diff= df.groupby('ID')['tDate'].diff(),
a = df.groupby('ID')['tDate'].diff().eq(pd.Timedelta(1, unit='d')),
a_neg = ~a,
s = (~a).cumsum(),
mask = a.groupby(s).transform('any'),
c = df.groupby(s).cumcount(1).add(1),
out = np.where(mask, df.groupby(s).cumcount(1).add(1), 0)))
ID tDate value consec_count diff a a_neg s mask c out
0 79 2019-07-12 397 1 NaT False True 1 True 1 1
1 79 2019-07-13 404 2 1 days True False 1 True 2 2
2 79 2019-07-18 405 1 5 days False True 2 True 1 1
3 79 2019-07-19 406 2 1 days True False 2 True 2 2
4 79 2019-07-20 408 3 1 days True False 2 True 3 3
5 79 2019-08-03 413 0 14 days False True 3 False 1 0
6 80 2019-06-21 397 1 NaT False True 4 True 1 1
7 80 2019-06-22 404 2 1 days True False 4 True 2 2
8 80 2019-07-18 405 1 26 days False True 5 True 1 1
9 80 2019-07-19 406 2 1 days True False 5 True 2 2
10 80 2019-07-26 408 0 7 days False True 6 False 1 0
11 80 2019-08-02 410 1 7 days False True 7 True 1 1
12 80 2019-08-03 413 2 1 days True False 7 True 2 2