overwriting a spark output using pyspark
Try:
spark_df.write.format('com.databricks.spark.csv') \
.mode('overwrite').option("header", "true").save(self.output_file_path)
Spark 1.4 and above has a built in csv function for the dataframewriter
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter
e.g.
spark_df.write.csv(path=self.output_file_path, header="true", mode="overwrite", sep="\t")
Which is syntactic sugar for
spark_df.write.format("csv").mode("overwrite").options(header="true",sep="\t").save(path=self.output_file_path)
I think what is confusing is finding where exactly the options are available for each format in the docs.
These write related methods belong to the DataFrameWriter
class:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter
The csv
method has these options available, also available when using format("csv")
:
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.csv
The way you need to supply parameters also depends on if the method takes a single (key, value)
tuple or keyword args. It's fairly standard to the way python works generally though, using (*args, **kwargs), it just differs from the Scala syntax.
For example
The option(key, value)
method takes one option as a tuple like option(header,"true")
and the .options(**options)
method takes a bunch of keyword assignments e.g. .options(header="true",sep="\t")
EDIT 2021
The docs have had a huge facelift which may be good from the perspective of new users discovering functionality from a requirement perspective, but does need some adjusting to.
DataframeReader and DataframeWriter are now part of the Input/Output in the API docs: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#input-and-output
The DataframeWriter.csv callable is now here https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameWriter.csv.html#pyspark.sql.DataFrameWriter.csv