Read all files in a nested folder in Spark

if you want use only files which start with name "a" ,you can use

sc.wholeTextFiles("/folder/a*/*/*.txt") or sc.wholeTextFiles("/folder/a*/a*/*.txt")

as well. We can use * as wildcard.


If directory structure is regular, lets say something like this:

folder
├── a
│   ├── a
│   │   └── aa.txt
│   └── b
│       └── ab.txt
└── b
    ├── a
    │   └── ba.txt
    └── b
        └── bb.txt

you can use * wildcard for each level of nesting as shown below:

>>> sc.wholeTextFiles("/folder/*/*/*.txt").map(lambda x: x[0]).collect()

[u'file:/folder/a/a/aa.txt',
 u'file:/folder/a/b/ab.txt',
 u'file:/folder/b/a/ba.txt',
 u'file:/folder/b/b/bb.txt']

Spark 3.0 provides an option recursiveFileLookup to load files from recursive subfolders.

val df= sparkSession.read
       .option("recursiveFileLookup","true")
      .option("header","true")
      .csv("src/main/resources/nested")

This recursively loads the files from src/main/resources/nested and it's subfolders.


sc.wholeTextFiles("/directory/201910*/part-*.lzo") get all match files name, not files content.

if you want to load the contents of all matched files in a directory, you should use

sc.textFile("/directory/201910*/part-*.lzo")

and setting reading directory recursive!

sc._jsc.hadoopConfiguration().set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

TIPS: scala differ with python, below set use to scala!

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

Tags:

Apache Spark