How to export a table dataframe in PySpark to csv?
If you cannot use spark-csv, you can do the following:
df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")
If you need to handle strings with linebreaks or comma that will not work. Use this:
import csv
import cStringIO
def row2csv(row):
buffer = cStringIO.StringIO()
writer = csv.writer(buffer)
writer.writerow([str(s).encode("utf-8") for s in row])
buffer.seek(0)
return buffer.read().strip()
df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")
If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas
method and then simply use to_csv
:
df.toPandas().to_csv('mycsv.csv')
Otherwise you can use spark-csv:
Spark 1.3
df.save('mycsv.csv', 'com.databricks.spark.csv')
Spark 1.4+
df.write.format('com.databricks.spark.csv').save('mycsv.csv')
In Spark 2.0+ you can use csv
data source directly:
df.write.csv('mycsv.csv')
For Apache Spark 2+, in order to save dataframe into single csv file. Use following command
query.repartition(1).write.csv("cc_out.csv", sep='|')
Here 1
indicate that I need one partition of csv only. you can change it according to your requirements.
You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,
df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')
Read more about the repartition function Read more about the save function
However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.
Read more on repartition vs coalesce functions.