Python 3 pandas.groupby.filter
The short answer:
grouped.apply(lambda x: x[x['B'] == x['B']].min())
... and the longer one:
Your grouped
object has 2 groups:
In[25]: for df in grouped:
...: print(df)
...:
('bar',
A B C
1 bar 2 5.0
3 bar 4 1.0
5 bar 6 9.0)
('foo',
A B C
0 foo 1 2.0
2 foo 3 8.0
4 foo 5 2.0)
filter()
method for GroupBy object is for filtering groups as entities, NOT for filtering their individual rows. So using the filter()
method, you may obtain only 4 results:
- an empty DataFrame (0 rows),
- rows of the group 'bar' (3 rows),
- rows of the group 'foo' (3 rows),
- rows of both groups (6 rows)
Nothing else, regardless of the used parameter (boolean function) in the filter()
method.
So you have to use some other method. An appropriate one is the very flexible apply()
method, which lets you apply an arbitrary function which
- takes a DataFrame (a group of GroupBy object) as its only parameter,
- returns either a Pandas object or a scalar.
In your case that function should return (for every of your 2 groups) the 1-row DataFrame having the minimal value in the column 'B'
, so we will use the Boolean mask
group['B'] == group['B'].min()
for selecting such a row (or - maybe - more rows):
In[26]: def select_min_b(group):
...: return group[group['B'] == group['B'].min()]
Now using this function as a parameter of the apply()
method of GroupBy object grouped
we will obtain
In[27]: grouped.apply(select_min_b)
Out[27]:
A B C
A
bar 1 bar 2 5.0
foo 0 foo 1 2.0
Note:
The same, but as only one command (using the lambda
function):
grouped.apply(lambda group: group[group['B'] == group['B']].min())
>>> # sort=False to return the rows in the order they originally occurred
>>> df.loc[df.groupby("A", sort=False)["B"].idxmin()]
A B C
0 foo 1 2.0
1 bar 2 5.0
There's a fundamental difference: In the documentation example, there is a single Boolean
value per group. That is, you return the entire group if the mean is greater than 3. In your example, you want to filter specific rows within a group.
For your task the usual trick is to sort values and use .head
or .tail
to filter to the row with the smallest or largest value respectively:
df.sort_values('B').groupby('A').head(1)
# A B C
#0 foo 1 2.0
#1 bar 2 5.0
For more complicated queries you can use .transform
or .apply
to create a Boolean Series
to slice. Also in this case safer if multiple rows share the minimum and you need all of them:
df[df.groupby('A').B.transform(lambda x: x == x.min())]
# A B C
#0 foo 1 2.0
#1 bar 2 5.0
No need groupby
:-)
df.sort_values('B').drop_duplicates('A')
Out[288]:
A B C
0 foo 1 2.0
1 bar 2 5.0