How to save a DataFrame as compressed (gzipped) CSV?

Spark 2.2+

df.write.option("compression","gzip").csv("path")

Spark 2.0

df.write.csv("path", compression="gzip")

Spark 1.6

On the spark-csv github: https://github.com/databricks/spark-csv

One can read:

codec: compression codec to use when saving to file. Should be the fully qualified name of a class implementing org.apache.hadoop.io.compress.CompressionCodec or one of case-insensitive shorten names (bzip2, gzip, lz4, and snappy). Defaults to no compression when a codec is not specified.

In this case, this works: df.write.format("com.databricks.spark.csv").codec("gzip")\ .save('my_directory/my_file.gzip')

With Spark 2.0+, this has become a bit simpler:

df.write.csv("path", compression="gzip")  # Python-only
df.write.option("compression", "gzip").csv("path") // Scala or Python

You don't need the external Databricks CSV package anymore.

The csv() writer supports a number of handy options. For example:

sep: To set the separator character.
quote: Whether and how to quote values.
header: Whether to include a header line.

There are also a number of other compression codecs you can use, in addition to gzip:

bzip2
lz4
snappy
deflate

The full Spark docs for the csv() writer are here: Python / Scala

This code works for Spark 2.1, where .codec is not available.

df.write
  .format("com.databricks.spark.csv")
  .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
  .save(my_directory)

For Spark 2.2, you can use the df.write.csv(...,codec="gzip") option described here: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=codec

How to save a DataFrame as compressed (gzipped) CSV?

Tags:

Csv

Scala

Apache Spark

Spark Dataframe

Related

Recent Posts