How to drop rows with nulls in one column pyspark
Dataframes are immutable. so just applying a filter that removes not null values will create a new dataframe which wouldn't have the records with null values.
df = df.filter(df.col_X. isNotNull())
Use either drop
with subset
:
df.na.drop(subset=["col_X"])
or isNotNull()
df.filter(df.col_X.isNotNull())
if you want to drop any row in which any value is null, use
df.na.drop() //same as df.na.drop("any") default is "any"
to drop only if all values are null for that row, use
df.na.drop("all")
to drop by passing a column list, use
df.na.drop("all", Seq("col1", "col2", "col3"))