Multiple condition filter on dataframe

TL;DR To pass multiple conditions to filter or where use Column objects and logical operators (&, |, ~). See Pyspark: multiple conditions in when clause.

df.filter((col("act_date") >= "2016-10-01") & (col("act_date") <= "2017-04-01"))

You can also use a single SQL string:

df.filter("act_date >='2016-10-01' AND act_date <='2017-04-01'")

In practice it makes more sense to use between:

df.filter(col("act_date").between("2016-10-01", "2017-04-01"))
df.filter("act_date BETWEEN '2016-10-01' AND '2017-04-01'")

The first approach is not even remote valid. In Python, and returns:

  • The last element if all expressions are "truthy".
  • The first "falsey" element otherwise.

As a result

"act_date <='2017-04-01'" and "act_date >='2016-10-01'"

is evaluated to (any non-empty string is truthy):

"act_date >='2016-10-01'"

In first case

df.filter("act_date <='2017-04-01'" and "act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

the result is values more than 2016-10-01 that means all the values above 2017-04-01 also.

Whereas in second case

df.filter("act_date <='2017-04-01'").filter("act_date >='2016-10-01'")\
  .select("col1","col2").distinct().count()

the result is the values between 2016-10-01 to 2017-04-01.