cleaning data with dropna in Pyspark
df.dropna(how='all', inplace=True)
df.show()
First thing is, na
creates a new dataframe, so assign it to new df name, and 2nd, specify subset to check which columns to check for null values
df2 = df.dropna(thresh=2,subset=('Age','Gender','Occupation'))
df2.show()
output :
+---+-------+------+---+------+----------+
| id| Name|Height|Age|Gender|Occupation|
+---+-------+------+---+------+----------+
| 1| Peter| 1.79| 28| M| Tiler|
| 2| Fritz| 1.78| 45| M| null|
| 4| Nicola| 1.6| 33| F| Dancer|
| 5|Gregory| 1.8| 54| M| Teacher|
| 7| Dagmar| 1.7| 42| F| Nurse|
+---+-------+------+---+------+----------+
edit : by the way, thresh=2
alone doesnt work because thresh means drop rows that have less than thresh (i.e. 2 in this case) non-null values, but 3rd row has id,name and height i.e total 3 non-nulls and 6th row has 4 non-nulls, so they dont satisfy thresh=2
criteria. You can try thresh=5
You can try this:
df.dropna(how='any').show()