How to recursively read Hadoop files from directory using Spark?

If you are using Spark, you can do this using wilcards as follow:

scala>sc.textFile("path/*/*")

sc is the SparkContext which if you are using spark-shell is initialized by default or if you are creating your own program should will have to instance a SparkContext by yourself.

Be careful with the following flag:

scala> sc.hadoopConfiguration.get("mapreduce.input.fileinputformat.input.dir.recursive") 
> res6: String = null

Yo should set this flag to true:

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive","true")

I have found that the parameters must be set in this way:

.set("spark.hive.mapred.supports.subdirectories","true")
.set("spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive","true")

How to recursively read Hadoop files from directory using Spark?

Tags:

Hadoop

Apache Spark

Related

Recent Posts