Remove duplicates from a dataframe in PySpark

if you have a data frame and want to remove all duplicates -- with reference to duplicates in a specific column (called 'colName'):

count before dedupe:

df.count()

do the de-dupe (convert the column you are de-duping to string type):

from pyspark.sql.functions import col
df = df.withColumn('colName',col('colName').cast('string'))

df.drop_duplicates(subset=['colName']).count()

can use a sorted groupby to check to see that duplicates have been removed:

df.groupBy('colName').count().toPandas().set_index("count").sort_index(ascending=False)

It is not an import problem. You simply call .dropDuplicates() on a wrong object. While class of sqlContext.createDataFrame(rdd1, ...) is pyspark.sql.dataframe.DataFrame, after you apply .collect() it is a plain Python list, and lists don't provide dropDuplicates method. What you want is something like this:

 (df1 = sqlContext
     .createDataFrame(rdd1, ['column1', 'column2', 'column3', 'column4'])
     .dropDuplicates())

 df1.collect()

Remove duplicates from a dataframe in PySpark

Tags:

Python

Duplicates

Apache Spark

Pyspark

Pyspark Dataframes

Related

Recent Posts