pandas: how do I select first row in each GROUP BY group?
Generally if you want your data sorted in a groupby but it's not one of the columns which are going to be grouped on then it's better to sort
the df prior to performing groupby
:
In [5]:
df.sort_values('B').groupby('A').first()
Out[5]:
B
A
bar 1
foo 1
EdChum's answer may not always work as intended. Instead of first()
use nth(0)
.
The method first()
is affected by this bug that has gone unsolved for some years now. Instead of the expected behaviour, first()
returns the first element that is not missing in each column within each group i.e. it ignores NaN values. For example, say you had a third column with some missing values:
df = pd.DataFrame({'A' : ['foo', 'foo', 'bar', 'bar', 'bar'],
'B' : ['1', '2','2', '4', '1'],
'C' : [np.nan, 'X', 'Y', 'Y', 'Y']})
A B C
0 foo 1 NaN
2 foo 2 X
3 bar 2 Y
4 bar 4 Y
5 bar 1 Y
Using first()
here (after sorting, just like EdChum correctly assessed in their answer) will skip over the missing values (note how it is mixing up values from different rows):
df.sort_values('B').groupby('A').first()
B C
A
bar 1 Y
foo 1 X
The correct way to get the full row, including missing values, is to use nth(0)
, which performs the expected operation:
df.sort_values('B').groupby('A').nth(0)
B C
A
bar 1 Y
foo 1 NaN
For completeness, this bug also affects last()
, its correct substitute being nth(-1)
.
Posting this in an answer since it's too long for a comment. Not sure this is within the scope of the question but I think it's relevant to many people looking for this answer (like myself before writing this) and is extremely easy to miss.
The pandas groupby function could be used for what you want, but it's really meant for aggregation. This is a simple 'take the first' operation.
What you actually want is the pandas drop_duplicates function, which by default will return the first row. What you usually would consider the groupby key, you should pass as the subset= variable
df.drop_duplicates(subset='A')
Should do what you want.
Also, df.sort('A')
does not sort the DataFrame df, it returns a copy which is sorted. If you want to sort it, you have to add the inplace=True
parameter.
df.sort('A', inplace=True)
Here's an alternative approach using groupby().rank()
:
df[ df.groupby('A')['B'].rank() == 1 ]
A B
1 foo 1
6 bar 1
This gives you the same answer as @EdChum's for the OP's sample dataframe, but could give a different answer if you have any ties during the sort, for example, with data like this:
df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'],
'B': ['2', '1', '1', '1'] })
In this case you have some options using the optional method
argument, depending on how you wish to handle sorting ties:
df[ df.groupby('A')['B'].rank(method='average') == 1 ] # the default
df[ df.groupby('A')['B'].rank(method='min') == 1 ]
df[ df.groupby('A')['B'].rank(method='first') == 1 ] # doesn't work, not sure why