How to copy and convert parquet files to csv
If there is a table defined over those parquet files in Hive (or if you define such a table yourself), you can run a Hive query on that and save the results into a CSV file. Try something along the lines of:
insert overwrite local directory dirname row format delimited fields terminated by ',' select * from tablename;
Substitute dirname
and tablename
with actual values. Be aware that any existing content in the specified directory gets deleted. See Writing data into the filesystem from queries for details.
Try
df = spark.read.parquet("/path/to/infile.parquet")
df.write.csv("/path/to/outfile.csv")
Relevant API documentation:
- pyspark.sql.DataFrameReader.parquet
- pyspark.sql.DataFrameWriter.csv
Both /path/to/infile.parquet
and /path/to/outfile.csv
should be locations on the hdfs filesystem. You can specify hdfs://...
explicitly or you can omit it as usually it is the default scheme.
You should avoid using file://...
, because a local file means a different file to every machine in the cluster. Output to HDFS instead then transfer the results to your local disk using the command line:
hdfs dfs -get /path/to/outfile.csv /path/to/localfile.csv
Or display it directly from HDFS:
hdfs dfs -cat /path/to/outfile.csv