Select rows containing certain values from pandas dataframe
You can create a boolean mask from comparing the entire df against your string and call dropna
passing param how='all'
to drop rows where your string doesn't appear in all cols:
In [59]:
df[df == 'banana'].dropna(how='all')
Out[59]:
A B C
1 NaN banana NaN
3 banana NaN NaN
To test for multiple values you can use multiple masks:
In [90]:
banana = df[(df=='banana')].dropna(how='all')
banana
Out[90]:
A B C
1 NaN banana NaN
3 banana NaN NaN
In [91]:
apple = df[(df=='apple')].dropna(how='all')
apple
Out[91]:
A B C
1 apple NaN NaN
2 NaN NaN apple
4 apple apple NaN
You can use index.intersection
to index just the common index values:
In [93]:
df.loc[apple.index.intersection(banana.index)]
Out[93]:
A B C
1 apple banana pear
Introduction
At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df
, let's call it mask
. So, finally with df[mask]
, we would get the selected rows off df
following boolean-indexing.
Here's our starting df
:
In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
I. Match one string
Now, if we need to match just one string, it's straight-foward with elementwise equality :
In [42]: df == 'banana'
Out[42]:
A B C
1 False True False
2 False False False
3 True False False
4 False False False
If we need to look ANY
one match in each row, use .any
method :
In [43]: (df == 'banana').any(axis=1)
Out[43]:
1 True
2 False
3 True
4 False
dtype: bool
To select corresponding rows :
In [44]: df[(df == 'banana').any(axis=1)]
Out[44]:
A B C
1 apple banana pear
3 banana pear pear
II. Match multiple strings
1. Search for ANY
match
Here's our starting df
:
In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
NumPy's np.isin
would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df
. So, say we are looking for 'pear'
or 'apple'
in df
:
In [51]: np.isin(df, ['pear','apple'])
Out[51]:
array([[ True, False, True],
[ True, True, True],
[False, True, True],
[ True, True, True]])
# ANY match along each row
In [52]: np.isin(df, ['pear','apple']).any(axis=1)
Out[52]: array([ True, True, True, True])
# Select corresponding rows with masking
In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)]
Out[56]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
2. Search for ALL
match
Here's our starting df
again :
In [42]: df
Out[42]:
A B C
1 apple banana pear
2 pear pear apple
3 banana pear pear
4 apple apple pear
So, now we are looking for rows that have BOTH
say ['pear','apple']
. We will make use of NumPy-broadcasting
:
In [66]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1)
Out[66]:
array([[ True, True],
[ True, True],
[ True, False],
[ True, True]])
So, we have a search list of 2
items and hence we have a 2D mask with number of rows = len(df)
and number of cols = number of search items
. Thus, in the above result, we have the first col for 'pear'
and second one for 'apple'
.
To make things concrete, let's get a mask for three items ['apple','banana', 'pear']
:
In [62]: np.equal.outer(df.to_numpy(copy=False), ['apple','banana', 'pear']).any(axis=1)
Out[62]:
array([[ True, True, True],
[ True, False, True],
[False, True, True],
[ True, False, True]])
The columns of this mask are for 'apple','banana', 'pear'
respectively.
Back to 2
search items case, we had earlier :
In [66]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1)
Out[66]:
array([[ True, True],
[ True, True],
[ True, False],
[ True, True]])
Since, we are looking for ALL
matches in each row :
In [67]: np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1).all(axis=1)
Out[67]: array([ True, True, False, True])
Finally, select rows :
In [70]: df[np.equal.outer(df.to_numpy(copy=False), ['pear','apple']).any(axis=1).all(axis=1)]
Out[70]:
A B C
1 apple banana pear
2 pear pear apple
4 apple apple pear
For single search value
df[df.values == "banana"]
or
df[df.isin(['banana'])]
For multiple search terms:
df[(df.values == "banana")|(df.values == "apple" ) ]
or
df[df.isin(['banana', "apple"])]
# A B C
# 1 apple banana NaN
# 2 NaN NaN apple
# 3 banana NaN NaN
# 4 apple apple NaN
From Divakar: lines with both are returned.
select_rows(df,['apple','banana'])
# A B C
# 0 apple banana pear