How to apply "first" and "last" functions to columns while using group by in pandas?
I think the issue is that there are two different first
methods which share a name but act differently, one is for groupby objects and another for a Series/DataFrame (to do with timeseries).
To replicate the behaviour of the groupby first
method over a DataFrame using agg
you could use iloc[0]
(which gets the first row in each group (DataFrame/Series) by index):
grouped.agg(lambda x: x.iloc[0])
For example:
In [1]: df = pd.DataFrame([[1, 2], [3, 4]])
In [2]: g = df.groupby(0)
In [3]: g.first()
Out[3]:
1
0
1 2
3 4
In [4]: g.agg(lambda x: x.iloc[0])
Out[4]:
1
0
1 2
3 4
Analogously you can replicate last
using iloc[-1]
.
Note: This will works column-wise, et al:
g.agg({1: lambda x: x.iloc[0]})
In older version of pandas you could would use the irow method (e.g. x.irow(0)
, see previous edits.
A couple of updated notes:
This is better done using the nth
groupby method, which is much faster >=0.13:
g.nth(0) # first
g.nth(-1) # last
You have to take care a little, as the default behaviour for first
and last
ignores NaN rows... and IIRC for DataFrame groupbys it was broken pre-0.13... there's a dropna
option for nth
.
You can use the strings rather than built-ins (though IIRC pandas spots it's the sum
builtin and applies np.sum
):
grouped['D'].agg({'result1' : "sum", 'result2' : "mean"})
Instead of using first
or last
, use their string representations in the agg
method. For example on the OP's case:
grouped = df.groupby(['ColumnName'])
grouped['D'].agg({'result1' : np.sum, 'result2' : np.mean})
#you can do the string representation for first and last
grouped['D'].agg({'result1' : 'first', 'result2' : 'last'})