Scala Dataframe null check for columns
The problem seems to be the operator precedence, try to use braces:
val new_df = df.filter(($"type_interne" !== "") || ($"type_interne" !== null))
you can also write it like this:
val new_df = df.filter(($"type_interne" !== "") or $"type_interne".isNotNull)
Though Raphael's answer was fully correct at the time of writing, spark evolving...
Operator !==
is deprecated since version 2.0, but you can use =!=
which solves precedence problem above without using parenthesis. See corresponding comments in source code:
https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L319-L320
Detailed answer:
I'd like also to note something which was not obvious for me in the beginning.
There are notions of DataFrame (DF) and DataSet (DS), which also divide their usage in above context into:
1) strings which interpreted by catalyst (error are caught up only in run time) - both DF and DS
case class NullStrings(n: Int, s: String)
val df = spark.sparkContext.parallelize(Seq(
(1, "abc"),
(2, "ABC"),
(3, null),
(4, ""))
).toDF("n", "s")
df.filter("s is not null and s != ''").show()
+---+---+
| n| s|
+---+---+
| 1|abc|
| 2|ABC|
+---+---+
2) dataframe syntax using Column
notion ($
with spark.implicits._
import) partially compile checked:
df.filter($"s" =!= "" || $"s" =!= null).show()
but in fact =!=
ignores nulls (see <=>
for null-safe comparison), hence below is equal to
df.filter($"s" =!= "").show()
+---+---+
| n| s|
+---+---+
| 1|abc|
| 2|ABC|
+---+---+
3) dataset
val ds = df.as[NullStrings]
ds.filter(r => r.s != null && r.s.nonEmpty).show()
+---+---+
| n| s|
+---+---+
| 1|abc|
| 2|ABC|
+---+---+
Beware if you use Option
in case class, you have to deal with it, not simple string.
case class NullStringsOption(n: Int, s: Option[String])
val ds1 = df.as[NullStringsOption]
ds1.filter(_.s.exists(_.nonEmpty)).show()