Spark textFile vs wholeTextFiles
The main difference, as you mentioned, is that textFile
will return an RDD with each line as an element while wholeTextFiles
returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile
.
When reading uncompressed files with textFile
, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles
should be used.
wholeTextFiles
will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.
textFile
generating partition for each file, whilewholeTextFiles
generates an RDD of pair values
That's not accurate:
textFile
loads one or more files, with each line as a record in the resulting RDD. A single file might be split into several partitions if the file is large enough (depends on the number of partitions requested, Spark's default number of partitions, and the underlying File System). When loading multiple files at once, this operation "loses" the relation between a record and the file that contained it - i.e. there's no way to know which file contained which line. The order of the records in the RDD will follow the alphabetical order of files, and the order of records within the files (order is not "lost").wholeTextFiles
preserves the relation between data and the files that contained it, by loading the data into aPairRDD
with one record per input file. The record will have the form(fileName, fileContent)
. This means that loading large files is risky (might cause bad performance orOutOfMemoryError
since each file will necessarily be stored on a single node). Partitioning is done based on user input or Spark's configuration - with multiple files potentially loaded into a single partition.
Generally speaking, textFile
serves the common use case of just loading a lot of data (regardless of how it's broken-down into files). readWholeFiles
should only be used if you actually need to know the originating file name of each record, and if you know all files are small enough.