How can I efficiently read multiple json files into a Dataframe or JavaRDD?
You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.
DataFrameReader
also provides json
method with a following signature:
json(jsonRDD: JavaRDD[String])
which can be used to parse JSON already loaded into JavaRDD
.
To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.
context.read().json("/home/spark/articles/*.json")
// or getting json out of s3
context.read().json("s3n://bucket/articles/201510*/*.json")