Difference between === null and isNull in Spark DataDrame
First and foremost don't use null
in your Scala code unless you really have to for compatibility reasons.
Regarding your question it is plain SQL. col("c1") === null
is interpreted as c1 = NULL
and, because NULL
marks undefined values, result is undefined for any value including NULL
itself.
spark.sql("SELECT NULL = NULL").show
+-------------+
|(NULL = NULL)|
+-------------+
| null|
+-------------+
spark.sql("SELECT NULL != NULL").show
+-------------------+
|(NOT (NULL = NULL))|
+-------------------+
| null|
+-------------------+
spark.sql("SELECT TRUE != NULL").show
+------------------------------------+
|(NOT (true = CAST(NULL AS BOOLEAN)))|
+------------------------------------+
| null|
+------------------------------------+
spark.sql("SELECT TRUE = NULL").show
+------------------------------+
|(true = CAST(NULL AS BOOLEAN))|
+------------------------------+
| null|
+------------------------------+
The only valid methods to check for NULL
are:
IS NULL
:spark.sql("SELECT NULL IS NULL").show
+--------------+ |(NULL IS NULL)| +--------------+ | true| +--------------+
spark.sql("SELECT TRUE IS NULL").show
+--------------+ |(true IS NULL)| +--------------+ | false| +--------------+
IS NOT NULL
:spark.sql("SELECT NULL IS NOT NULL").show
+------------------+ |(NULL IS NOT NULL)| +------------------+ | false| +------------------+
spark.sql("SELECT TRUE IS NOT NULL").show
+------------------+ |(true IS NOT NULL)| +------------------+ | true| +------------------+
implemented in DataFrame
DSL as Column.isNull
and Column.isNotNull
respectively.
Note:
For NULL
-safe comparisons use IS DISTINCT
/ IS NOT DISTINCT
:
spark.sql("SELECT NULL IS NOT DISTINCT FROM NULL").show
+---------------+
|(NULL <=> NULL)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL IS NOT DISTINCT FROM TRUE").show
+--------------------------------+
|(CAST(NULL AS BOOLEAN) <=> true)|
+--------------------------------+
| false|
+--------------------------------+
or not(_ <=> _)
/ <=>
spark.sql("SELECT NULL AS col1, NULL AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| true|
+---------------+
spark.sql("SELECT NULL AS col1, TRUE AS col2").select($"col1" <=> $"col2").show
+---------------+
|(col1 <=> col2)|
+---------------+
| false|
+---------------+
in SQL and DataFrame
DSL respectively.
Related:
Including null values in an Apache Spark Join
Usually the best way to shed light onto unexpected results in Spark Dataframes is to look at the explain plan. Consider the following example:
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
object Example extends App {
val session = SparkSession.builder().master("local[*]").getOrCreate()
case class Record(c1: String, c2: String)
val data = List(Record("a", "b"), Record(null, "c"))
val rdd = session.sparkContext.parallelize(data)
import session.implicits._
val df: DataFrame = rdd.toDF
val filtered = df.filter(col("c1") === null)
println(filtered.count()) // <-- outputs 0, not expected
val filtered2 = df.filter(col("c1").isNull)
println(filtered2.count())
println(filtered2) // <- outputs 1, as expected
filtered.explain(true)
filtered2.explain(true)
}
The first explain plan shows:
== Physical Plan ==
*Filter (isnotnull(c1#2) && null)
+- Scan ExistingRDD[c1#2,c2#3]
== Parsed Logical Plan ==
'Filter isnull('c1)
+- LogicalRDD [c1#2, c2#3]
This filter clause looks nonsensical. The &&
to null
ensures this can never resolve to true
.
The second explain plan looks like:
== Physical Plan ==
*Filter isnull(c1#2)
+- Scan ExistingRDD[c1#2,c2#3]
Here the filter is what expect and want.