How to read multiple text files into a single RDD?
You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat
and therefore this also works with Hadoop (and Scalding).
Use union
as follows:
val sc = new SparkContext(...)
val r1 = sc.textFile("xxx1")
val r2 = sc.textFile("xxx2")
...
val rdds = Seq(r1, r2, ...)
val bigRdd = sc.union(rdds)
Then the bigRdd
is the RDD with all files.
You can use a single textFile call to read multiple files. Scala:
sc.textFile(','.join(files))