Spark dataframe write method writing many small files

In Python you can rewrite Raphael's Roth answer as:

(df
  .repartition("date")
  .write.mode("append")
  .partitionBy("date")
  .parquet("{path}".format(path=path)))

You might also consider adding more columns to .repartition to avoid problems with very large partitions:

(df
  .repartition("date", another_column, yet_another_colum)
  .write.mode("append")
  .partitionBy("date)
  .parquet("{path}".format(path=path)))

you have to repartiton your DataFrame to match the partitioning of the DataFrameWriter

Try this:

df
.repartition($"date")
.write.mode(SaveMode.Append)
.partitionBy("date")
.parquet(s"$path")

The simplest solution would be to replace your actual partitioning by :

df
 .repartition(to_date($"date"))
 .write.mode(SaveMode.Append)
 .partitionBy("date")
 .parquet(s"$path")

You can also use more precise partitioning for your DataFrame i.e the day and maybe the hour of an hour range. and then you can be less precise for writer. That actually depends on the amount of data.

You can reduce entropy by partitioning DataFrame and the write with partition by clause.

Spark dataframe write method writing many small files

Tags:

Scala

Apache Spark

Related

Recent Posts