Pyspark: Filter dataframe based on multiple conditions
faster way (without pyspark.sql.functions
)
df.filter((df.d<5)&((df.col1 != df.col3) |
(df.col2 != df.col4) &
(df.col1 ==df.col3)))\
.show()
Your logic condition is wrong. IIUC, what you want is:
import pyspark.sql.functions as f
df.filter((f.col('d')<5))\
.filter(
((f.col('col1') != f.col('col3')) |
(f.col('col2') != f.col('col4')) & (f.col('col1') == f.col('col3')))
)\
.show()
I broke the filter()
step into 2 calls for readability, but you could equivalently do it in one line.
Output:
+----+----+----+----+---+
|col1|col2|col3|col4| d|
+----+----+----+----+---+
| A| xx| D| vv| 4|
| A| x| A| xx| 3|
| E| xxx| B| vv| 3|
| F|xxxx| F| vvv| 4|
| G| xxx| G| xx| 4|
+----+----+----+----+---+
You can also write like below (without pyspark.sql.functions
):
df.filter('d<5 and (col1 <> col3 or (col1 = col3 and col2 <> col4))').show()
Result:
+----+----+----+----+---+
|col1|col2|col3|col4| d|
+----+----+----+----+---+
| A| xx| D| vv| 4|
| A| x| A| xx| 3|
| E| xxx| B| vv| 3|
| F|xxxx| F| vvv| 4|
| G| xxx| G| xx| 4|
+----+----+----+----+---+