python pandas: applying different aggregate functions to different columns
As of pandas 0.25, this is possible with a "Named aggregation".
In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
....: 'height': [9.1, 6.0, 9.5, 34.0],
....: 'weight': [7.9, 7.5, 9.9, 198.0]})
....:
In [80]: animals
Out[80]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [82]: animals.groupby("kind").agg(
....: min_height=('height', 'min'),
....: max_height=('height', 'max'),
....: average_weight=('weight', np.mean),
....: )
....:
Out[82]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
The previously deprecated version follows:
You can pass a dictionary of dictionaries to .agg
mapping {column: {name: aggfunc}}
, for example
In [46]: df.head()
Out[46]:
Year qtr realgdp realcons realinvs realgovt realdpi cpi_u M1 \
0 1950 1 1610.5 1058.9 198.1 361.0 1186.1 70.6 110.20
1 1950 2 1658.8 1075.9 220.4 366.4 1178.1 71.4 111.75
2 1950 3 1723.0 1131.0 239.7 359.6 1196.5 73.2 112.95
3 1950 4 1753.9 1097.6 271.8 382.5 1210.0 74.9 113.93
4 1951 1 1773.5 1122.8 242.9 421.9 1207.9 77.3 115.08
tbilrate unemp pop infl realint
0 1.12 6.4 149.461 0.0000 0.0000
1 1.17 5.6 150.260 4.5071 -3.3404
2 1.23 4.6 151.064 9.9590 -8.7290
3 1.35 4.2 151.871 9.1834 -7.8301
4 1.40 3.5 152.393 12.6160 -11.2160
In [47]: df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
"unemp": {"mean_unemp": "mean"}})
Out[47]:
realgdp unemp
mean_gdp std_gdp mean_unemp
qtr
1 4506.439216 2104.195963 5.694118
2 4546.043137 2121.824090 5.686275
3 4580.507843 2132.897955 5.662745
4 4617.592157 2158.132698 5.654902
The result has a MultiIndex in the columns. If you don't want that outer level, you can use .columns.droplevel(0)
.
I agree this is a bit frustrating butI do find chaining with a rename
method served my purpose. Also, when it gets really complex, I will just reset the column names. It is a MultiIndex so it is immutable, and you should feel comfortable dealing with levels.
Based on the pandas documentation
The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation for a Series like this
In [67]: (grouped['C'].agg([np.sum, np.mean, np.std])
....: .rename(columns={'sum': 'foo',
....: 'mean': 'bar',
....: 'std': 'baz'})
....: )
....:
Out[67]:
foo bar baz
A
bar 0.392940 0.130980 0.181231
foo -1.796421 -0.359284 0.912265
When there are multiples uses of one function and you want to name it differently, this question of dropping the level and joining the different levels by underscore will help.
If you do find the sql syntax cleaner, there is a library called pandasql that give you this flexibility.