How can I Group By Month from a Date field using Python/Pandas
try this:
In [6]: df['date'] = pd.to_datetime(df['date'])
In [7]: df
Out[7]:
date Revenue
0 2017-06-02 100
1 2017-05-23 200
2 2017-05-20 300
3 2017-06-22 400
4 2017-06-21 500
In [59]: df.groupby(df['date'].dt.strftime('%B'))['Revenue'].sum().sort_values()
Out[59]:
date
May 500
June 1000
Try a groupby using a pandas Grouper:
df = pd.DataFrame({'date':['6/2/2017','5/23/2017','5/20/2017','6/22/2017','6/21/2017'],'Revenue':[100,200,300,400,500]})
df.date = pd.to_datetime(df.date)
dg = df.groupby(pd.Grouper(key='date', freq='1M')).sum() # groupby each 1 month
dg.index = dg.index.strftime('%B')
Revenue
May 500
June 1000
For DataFrame with many rows, using strftime
takes up more time. If the date column already has dtype of datetime64[ns]
(can use pd.to_datetime()
to convert, or specify parse_dates
during csv import, etc.), one can directly access datetime property for groupby
labels (Method 3). The speedup is substantial.
import numpy as np
import pandas as pd
T = pd.date_range(pd.Timestamp(0), pd.Timestamp.now()).to_frame(index=False)
T = pd.concat([T for i in range(1,10)])
T['revenue'] = pd.Series(np.random.randint(1000, size=T.shape[0]))
T.columns.values[0] = 'date'
print(T.shape) #(159336, 2)
print(T.dtypes) #date: datetime64[ns], revenue: int32
Method 1: strftime
%timeit -n 10 -r 7 T.groupby(T['date'].dt.strftime('%B'))['revenue'].sum()
1.47 s ± 10.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method 2: Grouper
%timeit -n 10 -r 7 T.groupby(pd.Grouper(key='date', freq='1M')).sum()
#NOTE Manually map months as integer {01..12} to strings
56.9 ms ± 2.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Method 3: datetime properties
%timeit -n 10 -r 7 T.groupby(T['date'].dt.month)['revenue'].sum()
#NOTE Manually map months as integer {01..12} to strings
34 ms ± 3.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)