Pandas GroupBy and select rows with the minimum value in a specific column
I feel like you're overthinking this. Just use groupby
and idxmin
:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin
) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby
with DataFrame.head
:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby
preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby
).
This approach has additional benefits:
- it can be easily expanded to select n rows with smallest values in specific column
- it can break ties by providing another column (as a list) to
.sort_values()
, e.g.:data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True)
is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)