Comparison operator in PySpark (not equal/ !=)
Why is it not filtering
Because it is SQL and NULL
indicates missing values. Because of that any comparison to NULL
, other than IS NULL
and IS NOT NULL
is undefined. You need either:
col("bar").isNull() | (col("bar") != 1)
or
coalesce(col("bar") != 1, lit(True))
or (PySpark >= 2.3):
col("bar").eqNullSafe(1)
if you want null safe comparisons in PySpark.
Also 'null'
is not a valid way to introduce NULL
literal. You should use None
to indicate missing objects.
from pyspark.sql.functions import col, coalesce, lit
df = spark.createDataFrame([
('a', 1, 1), ('a',1, None), ('b', 1, 1),
('c' ,1, None), ('d', None, 1),('e', 1, 1)
]).toDF('id', 'foo', 'bar')
df.where((col("foo") == 1) & (col("bar").isNull() | (col("bar") != 1))).show()
## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## | a| 1|null|
## | c| 1|null|
## +---+---+----+
df.where((col("foo") == 1) & coalesce(col("bar") != 1, lit(True))).show()
## +---+---+----+
## | id|foo| bar|
## +---+---+----+
## | a| 1|null|
## | c| 1|null|
## +---+---+----+
To filter null values try:
foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull