Spark textFile vs wholeTextFiles

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile.

When reading uncompressed files with textFile, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles should be used.

wholeTextFiles will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.

textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values

That's not accurate:

textFile loads one or more files, with each line as a record in the resulting RDD. A single file might be split into several partitions if the file is large enough (depends on the number of partitions requested, Spark's default number of partitions, and the underlying File System). When loading multiple files at once, this operation "loses" the relation between a record and the file that contained it - i.e. there's no way to know which file contained which line. The order of the records in the RDD will follow the alphabetical order of files, and the order of records within the files (order is not "lost").
wholeTextFiles preserves the relation between data and the files that contained it, by loading the data into a PairRDD with one record per input file. The record will have the form (fileName, fileContent). This means that loading large files is risky (might cause bad performance or OutOfMemoryError since each file will necessarily be stored on a single node). Partitioning is done based on user input or Spark's configuration - with multiple files potentially loaded into a single partition.

Generally speaking, textFile serves the common use case of just loading a lot of data (regardless of how it's broken-down into files). readWholeFiles should only be used if you actually need to know the originating file name of each record, and if you know all files are small enough.

Spark textFile vs wholeTextFiles

Tags:

File Io

Scala

Apache Spark

Related

Recent Posts