cleaning data with dropna in Pyspark

df.dropna(how='all', inplace=True)

df.show()

First thing is, na creates a new dataframe, so assign it to new df name, and 2nd, specify subset to check which columns to check for null values

df2 = df.dropna(thresh=2,subset=('Age','Gender','Occupation'))

df2.show()

output :

+---+-------+------+---+------+----------+
| id|   Name|Height|Age|Gender|Occupation|
+---+-------+------+---+------+----------+
|  1|  Peter|  1.79| 28|     M|     Tiler|
|  2|  Fritz|  1.78| 45|     M|      null|
|  4| Nicola|   1.6| 33|     F|    Dancer|
|  5|Gregory|   1.8| 54|     M|   Teacher|
|  7| Dagmar|   1.7| 42|     F|     Nurse|
+---+-------+------+---+------+----------+

edit : by the way, thresh=2 alone doesnt work because thresh means drop rows that have less than thresh (i.e. 2 in this case) non-null values, but 3rd row has id,name and height i.e total 3 non-nulls and 6th row has 4 non-nulls, so they dont satisfy thresh=2 criteria. You can try thresh=5

You can try this:

df.dropna(how='any').show()

cleaning data with dropna in Pyspark

Tags:

Data Cleaning

Pyspark

Related

Recent Posts