Read all Parquet files saved in a folder via Spark

You can write data into folder not as separate Spark "files" (in fact folders) 1.parquet, 2.parquet etc. If don't set file name but only path, Spark will put files into the folder as real files (not folders), and automatically name that files.

df1.write.partitionBy("countryCode").format("parquet").mode("overwrite").save("/tmp/data1/")
df2.write.partitionBy("countryCode").format("parquet").mode("append").save("/tmp/data1/")
df3.write.partitionBy("countryCode").format("parquet").mode("append").save("/tmp/data1/")

Further we can read data from all files in data folder:

val df = spark.read.format("parquet").load("/tmp/data1/")

Spark doesn't write/read parquet the way you think it does.

It uses the Hadoop library to write/read partitioned parquet file.

Thus your first parquet file is under the path /tmp/test/df/1.parquet/ where 1.parquet is a directory. This means that when reading from parquet you would need to provide the path to your parquet directory or path if it's one file.

val df = spark.read.parquet("/tmp/test/df/1.parquet/")

I advice you to read the official documentation for more details. [cf. SQL Programming Guide - Parquet Files]

EDIT:

You must be looking for something like this :

scala> sqlContext.range(1,100).write.save("/tmp/test/df/1.parquet")

scala> sqlContext.range(100,500).write.save("/tmp/test/df/2.parquet")

scala> val df = sqlContext.read.load("/tmp/test/df/*")
// df: org.apache.spark.sql.DataFrame = [id: bigint]

scala> df.show(3)
// +---+
// | id|
// +---+
// |400|
// |401|
// |402|
// +---+
// only showing top 3 rows

scala> df.count
// res3: Long = 499

You can also use wildcards in your file paths URI.

And you can provide multiple files paths as followed :

scala> val df2 = sqlContext.read.load("/tmp/test/df/1.parquet","/tmp/test/df/2.parquet")
// df2: org.apache.spark.sql.DataFrame = [id: bigint]

scala> df2.count
// res5: Long = 499

The file you wrote on /tmp/test/df/1.parquet and /tmp/test/df/2.parquet are not a output file they are output Directory. so, you can read the parquet is

val data = spark.read.parquet("/tmp/test/df/1.parquet/")

Read all Parquet files saved in a folder via Spark

Tags:

Scala

Apache Spark

Apache Spark Sql

Related

Recent Posts