What's the fastest way to read a simple numerical list?
Your real problem is
I have a large .txt file
In a txt
file, each number is represented by a sequence of bytes with possibly varying length followed by a newline. When you read it back in, the bytes representing the digits, the number-point and the newline need to be converted back to real numbers. This is time-consuming.
On the other hand, in a binary file, each real number has fixed number of bytes and there doesn't need to be some separator between them like a newline. For instance, a "Real32"
contains only 4 bytes, while the same number, e.g. 0.827736082
needs the following bytes in your text-file
ToCharacterCode["0.827736082"]
(* {48, 46, 56, 50, 55, 55, 51, 54, 48, 56, 50} *)
Therefore, to turn this back into a number, you need to re-convert each byte into a digit and rebuilt a real number from it. To give you an idea how time-consuming this is, let us create a sample list with 10^6 entries and export it to a list like yours and to a binary file
list = RandomReal[1, 10^6];
asciFile = "~/tmp/ascii.txt";
binFile = "~/tmp/binary.data";
Export[asciFile, list, "Table"];
st = OpenWrite["~/tmp/binary.data", BinaryFormat -> True];
BinaryWrite[st, list, "Real32"];
Close[st];
The first thing you should note is vast difference in file size
FileByteCount /@ {asciFile, binFile}
(* {19270172, 4000000} *)
and after understanding the paragraphs above, it's clear where this difference comes from. The numbers in list
will look like this 0.21946835530269415
which consists of 19 bytes. Some will be shorter though. With Real32 number using 4 bytes you clearly see where this difference comes from.
When we look at the timings to read back in each file, we see that the ASCII file needs about 10x the time of reading the binary file.
BinaryReadList[asciFile, "Byte"]; // RepeatedTiming
(* {0.12, Null} *)
BinaryReadList[binFile, "Real32"]; // RepeatedTiming
(* {0.018, Null} *)
Therefore, the only acceptable solution for serially reading your data is to use a binary format for storing your data in the first place. This is supported by the fact that even C code that consists of nothing more than a loop to read in your file is not faster.
If your numbers were of fixed length, it might be worth a shot to read in the data in bytes and convert them in parallel into real numbers, but this is not possible in your current layout.
This is not a complete code answer, but you should get the idea.
After running some tests, it seems that reading the data in parallel seems to speed up the process. Unfortunately there are a few hurdles to overcome when splitting up the stream.
Here is what I came up with so far:
filename = "D:\\temp\\list.txt"; (*replace with your file*)
LaunchKernels[];
kernels = Kernels[];
stream = OpenRead[filename];
endPosition = SetStreamPosition[stream, Infinity]; (*obtain the last position in the stream*)
outList = Flatten@ParallelTable[
list = {};
listTemp = {};
stream = OpenRead[filename];
SetStreamPosition[
stream, (i - 1)*Round[endPosition/Length[kernels]]];
Skip[stream, Real];
While[True, (*this is the part that is not working properly at the end of each stream segment*)
listTemp = ReadList[stream, Real, 100];
If[StreamPosition[stream] >= i*Round[endPosition/Length[kernels]],
Break[]];
AppendTo[list, listTemp];
];
list
, {i, $KernelCount}
]; // AbsoluteTiming
The code gets the Length of the stream in characters (note that StreamPosition
is the position in characters, not in lines or numbers). So this is also the part where it gets a little tricky and the code above does not read every single number of your file yet. You might come up with a proper solution, the code above should give you the idea.
Overall this code above cut down the reading time to approximately 1/$KernelCount
of the time when using ReadList
with a single kernel in my experiment.