4 ways to select rows from a DataFrame based on column values
There are several ways to select rows from a Pandas dataframe:
- Boolean indexing (DataFrame[DataFrame['col'] == value])
- Positional indexing (DataFrame.iloc[...])
- Label indexing (DataFrame.xs(...))
- DataFrame.query(...) API
Below I show you examples of each, with advice when to use certain techniques. Assume our criterion is column 'A' == 'bar'
Setup
The first thing we'll need is to identify a condition that will act as our criterion for selecting rows. We'll start with the OP's case column_name == some_value
, and include some other common use cases.
An example would be:
import pandas as pd, numpy as np
df = pd.DataFrame({
'A':
'Contrary bar popular bar Lorem bar Ipsum is not simply'.split(),
'B':
'Lorem Ipsum comes from sections one two three four five'.split(),
'C':
np.arange(10),
'D':
np.arange(10) * 2
})
print(df)
Output:
A B C D
0 Contrary Lorem 0 0
1 bar Ipsum 1 2
2 popular comes 2 4
3 bar from 3 6
4 Lorem sections 4 8
5 bar one 5 10
6 Ipsum two 6 12
7 is three 7 14
8 not four 8 16
9 simply five 9 18
Method 1: Boolean indexing (DataFrame[DataFrame['col'] == value] ) #
This is one of the simplest ways to accomplish this task and if performance or intuitiveness isn't an issue, this should be your chosen method. However, if performance is a concern, then you might want to consider an alternative way of creating the mask.
mask = df['A'] == 'bar'
print(mask)
Output:
A B C D
1 bar Ipsum 1 2
3 bar from 3 6
5 bar one 5 10
Method 2: Positional indexing (DataFrame.iloc[...]) #
Positional indexing (DataFrame.iloc[...]) has its use cases, but this isn't one of them. In order to identify where to slice, we first need to perform the same boolean analysis we did above. This leaves us performing one extra step to accomplish the same task.
mask = df['A'] == 'bar'
pos = np.flatnonzero(mask)
print(df.iloc[pos])
Output as same as the previous method:
A B C D
1 bar Ipsum 1 2
3 bar from 3 6
5 bar one 5 10
Method 3: Label indexing (DataFrame.xs(...)) #
Label indexing can be very handy, but in this case, we are again doing more work for no benefit
res = df.set_index('A', append=True, drop=False).xs('bar', level=1)
print(res)
Output as same as the previous methods:
A B C D
1 bar Ipsum 1 2
3 bar from 3 6
5 bar one 5 10
Method 4: DataFrame.query(...) API #
pd.DataFrame.query is a very elegant/intuitive way to perform this task, but is often slower. However, if you pay attention to the timings below, for large data, the query is very efficient. More so than the standard approach and of similar magnitude as my best suggestion.
res = df.query('A == "bar"')
print(res)
Output as same as the previous methods:
A B C D
1 bar Ipsum 1 2
3 bar from 3 6
5 bar one 5 10