how to filter out a null value from spark dataframe
There are two ways to do it: creating filter condition 1) Manually 2) Dynamically.
Sample DataFrame:
val df = spark.createDataFrame(Seq(
(0, "a1", "b1", "c1", "d1"),
(1, "a2", "b2", "c2", "d2"),
(2, "a3", "b3", null, "d3"),
(3, "a4", null, "c4", "d4"),
(4, null, "b5", "c5", "d5")
)).toDF("id", "col1", "col2", "col3", "col4")
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
| 2| a3| b3|null| d3|
| 3| a4|null| c4| d4|
| 4|null| b5| c5| d5|
+---+----+----+----+----+
1) Creating filter condition manually i.e. using DataFrame where
or filter
function
df.filter(col("col1").isNotNull && col("col2").isNotNull).show
or
df.where("col1 is not null and col2 is not null").show
Result:
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
| 2| a3| b3|null| d3|
+---+----+----+----+----+
2) Creating filter condition dynamically: This is useful when we don't want any column to have null value and there are large number of columns, which is mostly the case.
To create the filter condition manually in these cases will waste a lot of time. In below code we are including all columns dynamically using map
and reduce
function on DataFrame columns:
val filterCond = df.columns.map(x=>col(x).isNotNull).reduce(_ && _)
How filterCond
looks:
filterCond: org.apache.spark.sql.Column = (((((id IS NOT NULL) AND (col1 IS NOT NULL)) AND (col2 IS NOT NULL)) AND (col3 IS NOT NULL)) AND (col4 IS NOT NULL))
Filtering:
val filteredDf = df.filter(filterCond)
Result:
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
+---+----+----+----+----+
Or like df.filter($"friend_id".isNotNull)
df.where(df.col("friend_id").isNull)
Let's say you have this data setup (so that results are reproducible):
// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)
// setting up example data
val e1 = Employee("n1", null, "[email protected]", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "[email protected]", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "[email protected]", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "[email protected]", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "[email protected]", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "[email protected]", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "[email protected]", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "[email protected]", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF
Data is:
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|[email protected]|[c1,1,d1]|
| n2| 2|[email protected]|[c1,1,d1]|
| n3| 3|[email protected]|[c1,1,d1]|
| n4| 4|[email protected]|[c2,2,d2]|
| n5|null|[email protected]|[c2,2,d2]|
| n6| 6|[email protected]|[c2,2,d2]|
| n7| 7|[email protected]|[c3,3,d3]|
| n8| 8|[email protected]|[c3,3,d3]|
+----+----+---------+---------+
Now to filter employees with null
ids, you will do --
df.filter("id is null").show
which will correctly show you following:
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|[email protected]|[c1,1,d1]|
| n5|null|[email protected]|[c2,2,d2]|
+----+----+---------+---------+
Coming to the second part of your question, you can replace the null
ids with 0 and other values with 1 with this --
df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show
This results in:
+----+---+---------+---------+
|name| id| email| company|
+----+---+---------+---------+
| n1| 0|[email protected]|[c1,1,d1]|
| n2| 1|[email protected]|[c1,1,d1]|
| n3| 1|[email protected]|[c1,1,d1]|
| n4| 1|[email protected]|[c2,2,d2]|
| n5| 0|[email protected]|[c2,2,d2]|
| n6| 1|[email protected]|[c2,2,d2]|
| n7| 1|[email protected]|[c3,3,d3]|
| n8| 1|[email protected]|[c3,3,d3]|
+----+---+---------+---------+