Cloud Dataflow: reading entire text files rather than lines by line
I am going to give the most generally useful answer, even though there are special cases [1] where you might do something different.
I think what you want to do is to define a new subclass of FileBasedSource
and use Read.from(<source>)
. Your source will also include a subclass of FileBasedReader
; the source contains the configuration data and the reader actually does the reading.
I think a full description of the API is best left to the Javadoc, but I will highlight the key override points and how they relate to your needs:
FileBasedSource#isSplittable()
you will want to override and returnfalse
. This will indicate that there is no intra-file splitting.FileBasedSource#createForSubrangeOfFile(String, long, long)
you will override to return a sub-source for just the file specified.FileBasedSource#createSingleFileReader()
you will override to produce aFileBasedReader
for the current file (the method should assume it is already split to the level of a single file).
To implement the reader:
FileBasedReader#startReading(...)
you will override to do nothing; the framework will already have opened the file for you, and it will close it.FileBasedReader#readNextRecord()
you will override to read the entire file as a single element.
[1] One example easy special case is when you actually have a small number of files, you can expand them prior to job submission, and they all take the same amount of time to process. Then you can just use Create.of(expand(<glob>))
followed by ParDo(<read a file>)
.