Filter Pandas dataframe based on combination of two columns
The tuple comparison approach as outlined by @Vivek Kalyanarangan is the way to go but the speed can be significantly increased in case of large dataframes by utilizing the MultiIndex instead of using an apply function for tuple creation:
For example, in your case:
keep_tuples = [(1,2), (4,3)]
tuples_in_df = pd.MultiIndex.from_frame(df[["a","b"]])
df[tuples_in_df.isin(keep_tuples)]
This leads to ~5X speed improvement on a 1,000,000 X 2 sized df when compared to using apply function.
Use -
df[df[['a', 'b']].apply(tuple, axis=1).isin([(1,2), (4,3)])]
Output
a b
0 1 2
3 4 3
Explanation
df[['a', 'b']].apply(tuple, axis=1)
gives a series of tuples -
0 (1, 2)
1 (2, 3)
2 (4, 2)
3 (4, 3)
.isin([(1,2), (4,3)])
searches for the desired tuples and gives a boolean series
Another idea is to join the two columns (a
and b
) as strings, and check for 12
and 43
, i.e.
df[df.astype(str).sum(axis = 1).isin([12, 43])]
# a b
#0 1 2
#3 4 3