pandas multiple date ranges from column of dates
reindex
ing with pd.date_range
Let's try creating a flat list of date-ranges and reindexing this DataFrame.
from itertools import chain
v = df.assign(Date=pd.to_datetime(df.Date)).set_index('Date')
# assuming ID is a string column
v.reindex(chain.from_iterable(
pd.date_range(end=i, periods=5) for i in v.index)
).bfill().reset_index()
Date ID
0 2018-03-14 11
1 2018-03-15 11
2 2018-03-16 11
3 2018-03-17 11
4 2018-03-18 11
5 2018-03-19 11
6 2017-12-31 22
7 2018-01-01 22
8 2018-01-02 22
9 2018-01-03 22
10 2018-01-04 22
11 2018-01-05 22
12 2018-02-07 33
13 2018-02-08 33
14 2018-02-09 33
15 2018-02-10 33
16 2018-02-11 33
17 2018-02-12 33
concat
based solution on keys
Just for fun. My reindex
solution is definitely more performant and easier to read, so if you were to pick one, use that.
v = df.assign(Date=pd.to_datetime(df.Date))
v_dict = {
j : pd.DataFrame(
pd.date_range(end=i, periods=5), columns=['Date']
)
for j, i in zip(v.ID, v.Date)
}
(pd.concat(v_dict, axis=0)
.reset_index(level=1, drop=True)
.rename_axis('ID')
.reset_index()
)
ID Date
0 11 2018-03-14
1 11 2018-03-15
2 11 2018-03-16
3 11 2018-03-17
4 11 2018-03-18
5 11 2018-03-19
6 22 2017-12-31
7 22 2018-01-01
8 22 2018-01-02
9 22 2018-01-03
10 22 2018-01-04
11 22 2018-01-05
12 33 2018-02-07
13 33 2018-02-08
14 33 2018-02-09
15 33 2018-02-10
16 33 2018-02-11
17 33 2018-02-12
Here is another by using df.assign
to overwrite date
and pd.concat
to glue the range together. cᴏʟᴅsᴘᴇᴇᴅ's solution wins in performance but I think this might be a nice addition as it is quite easy to read and understand.
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
Alternative:
dates = (pd.date_range(*x) for x in zip(df['Date']-pd.Timedelta(days=4), df['Date']))
df = (pd.DataFrame(dict(zip(df['ID'],dates)))
.T
.stack()
.reset_index(0)
.rename(columns={'level_0': 'ID', 0: 'Date'}))
Full example:
import pandas as pd
data = '''\
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df['Date']= pd.to_datetime(df.Date)
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
df.sort_values(by=['ID','Date'], ascending = [True,True], inplace=True)
print(df)
Returns:
ID Date
0 11 2018-03-15
0 11 2018-03-16
0 11 2018-03-17
0 11 2018-03-18
0 11 2018-03-19
1 22 2018-01-01
1 22 2018-01-02
1 22 2018-01-03
1 22 2018-01-04
1 22 2018-01-05
2 33 2018-02-08
2 33 2018-02-09
2 33 2018-02-10
2 33 2018-02-11
2 33 2018-02-12
group by ID
, select the column Date
, and for each group generate a series of five days leading up to the greatest date.
rather than writing a long lambda, I've written a helper function.
def drange(x):
e = x.max()
s = e-pd.Timedelta(days=4)
return pd.Series(pd.date_range(s,e))
res = df.groupby('ID').Date.apply(drange)
Then drop the extraneous level from the resulting multiindex and we get our desired output
res.reset_index(level=0).reset_index(drop=True)
# outputs:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
5 22 2018-01-01
6 22 2018-01-02
7 22 2018-01-03
8 22 2018-01-04
9 22 2018-01-05
10 33 2018-02-08
11 33 2018-02-09
12 33 2018-02-10
13 33 2018-02-11
14 33 2018-02-12
Compact alternative
# Help function to return Serie with daterange
func = lambda x: pd.date_range(x.iloc[0]-pd.Timedelta(days=4), x.iloc[0]).to_series()
res = df.groupby('ID').Date.apply(func).reset_index().drop('level_1',1)