Spark Dataframe Nested Case When Statement
For more complex logic, I prefer to use UDFs for better readability:
val selectCase = udf((tc: String, amt: String) =>
if (Seq("a", "b").contains(tc)) "Y"
else if (tc == "a" && amt.toInt <= 0) "N"
else null
)
dataset1.withColumn("REASON", selectCase(col("tc"), col("amt")))
.show
There is no nesting here, therefore there is no need for otherwise
. All you need is chained when
:
import spark.implicits._
when($"tc" isin ("a", "b"), "Y")
.when($"tc" === "a" && $"amt" >= 0, "N")
ELSE NULL
is implicit so you can omit it completely.
Pattern you use, is more more applicable for folding
over a data structure:
val cases = Seq(
($"tc" isin ("a", "b"), "Y"),
($"tc" === "a" && $"amt" >= 0, "N")
)
where when
- otherwise
naturally follows recursion pattern and null
provides the base case.
cases.foldLeft(lit(null)) {
case (acc, (expr, value)) => when(expr, value).otherwise(acc)
}
Please note, that it is impossible to reach "N" outcome, with this chain of conditions. If tc
is equal to "a" it will be captured by the first clause. If it is not, it will fail to satisfy both predicates and default to NULL
. You should rather:
when($"tc" === "a" && $"amt" >= 0, "N")
.when($"tc" isin ("a", "b"), "Y")