How to copy and convert parquet files to csv

If there is a table defined over those parquet files in Hive (or if you define such a table yourself), you can run a Hive query on that and save the results into a CSV file. Try something along the lines of:

insert overwrite local directory dirname
  row format delimited fields terminated by ','
  select * from tablename;

Substitute dirname and tablename with actual values. Be aware that any existing content in the specified directory gets deleted. See Writing data into the filesystem from queries for details.

Try

df = spark.read.parquet("/path/to/infile.parquet")
df.write.csv("/path/to/outfile.csv")

Relevant API documentation:

pyspark.sql.DataFrameReader.parquet
pyspark.sql.DataFrameWriter.csv

Both /path/to/infile.parquet and /path/to/outfile.csv should be locations on the hdfs filesystem. You can specify hdfs://... explicitly or you can omit it as usually it is the default scheme.

You should avoid using file://..., because a local file means a different file to every machine in the cluster. Output to HDFS instead then transfer the results to your local disk using the command line:

hdfs dfs -get /path/to/outfile.csv /path/to/localfile.csv

Or display it directly from HDFS:

hdfs dfs -cat /path/to/outfile.csv

How to copy and convert parquet files to csv

Tags:

Python

Hadoop

Apache Spark

Pyspark

Parquet

Related

Recent Posts