spark df.write.partitionBy run very slow

Try adding a repartition("day") before the write, like this:

spark
  .sql(sql)
  .repartition("day")
  .write
  .partitionBy("day")
  .json(output_path)

It should speed up your query.

Try adding repartition(any number ) to start with, then try increasing / decreasing the number depending upon the time it takes to write

 spark
  .sql(sql)
  .repartition(any number)
  .write
  .partitionBy("day")
  .json(output_path)

Tags:

Scala

Apache Spark

Apache Spark Sql

Spark Dataframe

Related