Efficient way to apply multiple filters to pandas DataFrame or Series
Chaining conditions creates long lines, which are discouraged by pep8. Using the .query method forces to use strings, which is powerful but unpythonic and not very dynamic.
Once each of the filters is in place, one approach is
import numpy as np
import functools
def conjunction(*conditions):
return functools.reduce(np.logical_and, conditions)
c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4
data_filtered = data[conjunction(c1,c2,c3)]
np.logical operates on and is fast, but does not take more than two arguments, which is handled by functools.reduce.
Note that this still has some redundancies: a) shortcutting does not happen on a global level b) Each of the individual conditions runs on the whole initial data. Still, I expect this to be efficient enough for many applications and it is very readable.
You can also make a disjunction (wherein only one of the conditions needs to be true) by using np.logical_or
instead:
import numpy as np
import functools
def disjunction(*conditions):
return functools.reduce(np.logical_or, conditions)
c_1 = data.col1 == True
c_2 = data.col2 < 64
c_3 = data.col3 != 4
data_filtered = data[disjunction(c_1,c_2,c_3)]
Simplest of All Solutions:
Use:
filtered_df = df[(df['col1'] >= 1) & (df['col1'] <= 5)]
Another Example, To filter the dataframe for values belonging to Feb-2018, use the below code
filtered_df = df[(df['year'] == 2018) & (df['month'] == 2)]
Pandas (and numpy) allow for boolean indexing, which will be much more efficient:
In [11]: df.loc[df['col1'] >= 1, 'col1']
Out[11]:
1 1
2 2
Name: col1
In [12]: df[df['col1'] >= 1]
Out[12]:
col1 col2
1 1 11
2 2 12
In [13]: df[(df['col1'] >= 1) & (df['col1'] <=1 )]
Out[13]:
col1 col2
1 1 11
If you want to write helper functions for this, consider something along these lines:
In [14]: def b(x, col, op, n):
return op(x[col],n)
In [15]: def f(x, *b):
return x[(np.logical_and(*b))]
In [16]: b1 = b(df, 'col1', ge, 1)
In [17]: b2 = b(df, 'col1', le, 1)
In [18]: f(df, b1, b2)
Out[18]:
col1 col2
1 1 11
Update: pandas 0.13 has a query method for these kind of use cases, assuming column names are valid identifiers the following works (and can be more efficient for large frames as it uses numexpr behind the scenes):
In [21]: df.query('col1 <= 1 & 1 <= col1')
Out[21]:
col1 col2
1 1 11