Multiple condition filter on dataframe
TL;DR To pass multiple conditions to filter
or where
use Column
objects and logical operators (&
, |
, ~
). See Pyspark: multiple conditions in when clause.
df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))
You can also use a single SQL string:
df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")
In practice it makes more sense to use between:
df.filter(col("act_date").between("2016-10-01", "2017-04-01"))
df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")
The first approach is not even remote valid. In Python, and
returns:
- The last element if all expressions are "truthy".
- The first "falsey" element otherwise.
As a result
"act_date <='2017-04-01'" and "act_date >='2016-10-01'"
is evaluated to (any non-empty string is truthy):
"act_date >='2016-10-01'"
In first case
df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
.select("col1","col2").distinct().count()
the result is values more than 2016-10-01 that means all the values above 2017-04-01 also.
Whereas in second case
df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
.select("col1","col2").distinct().count()
the result is the values between 2016-10-01 to 2017-04-01.