Reading a file as a ByteArray?
Import[..., "String"]
and Export[..., "String"]
are meant precisely for this and will not cause problems. This is guaranteed to give you "character codes" between 0..255, and the string can represent the file contents exactly.
This differs significantly from Import[..., "Text"]
which will handle character encodings, line endings, etc. and is meant for text, not for arbitrary binary data.
I have not used ByteArray
and I am not sure about its purpose, but I got the impression that it is meant for the cryptography functionality as I couldn't find any other high-level functions that work with it. It does work with basic list manipulation functions though.
I know I can later apply ByteArray to the result of Import, but I would prefer to not use so much memory intermittently.
How about reading in chunks, and packing each chunk into a ByteArray
, to avoid high memory usage?
chunkSize = 300*1024; (* 300 kB due to the size of my test file *)
stream = OpenRead["file.pdf", BinaryFormat -> True];
ba = Join @@ First@Last@Reap@While[True,
res = BinaryReadList[stream, "Byte", chunkSize];
If[res === {}, Break[]]; (* there was no more data to read *)
Sow[ByteArray[res]]
]
Close[stream]
You can use the new in M11.3 function ReadByteArray
:
path = ExampleData[{"TestImage", "Mandrill"}, "FilePath"];
MaxMemoryUsed[ba = ReadByteArray[path]] //AbsoluteTiming
Head[ba]
{0.000439, 723768}
ByteArray
It seems that using new in version 11.2 StringToByteArray
you can reduce the memory requirements:
file = ExampleData[{"TestImage", "Mandrill"}, "FilePath"];
byteList = BinaryReadList[path];
string = Import[path, "String"];
byteList // ByteCount
string // ByteCount
byteArray = StringToByteArray[string]; // MaxMemoryUsed
byteArray
5024024 945208 945504
As one can see from the above, String
requires 5 times lesser memory than packed array of integers returned by BinaryReadList
while StringToByteArray
(without second argument) takes only a tiny amount of additional memory.
A more memory efficient representation of the data can be obtained by specifying "ISO8859-1"
encoding by the cost of increasing intermediate memory requirements:
byteArray2 = StringToByteArray[string, "ISO8859-1"]; // MaxMemoryUsed
byteArray2
1573240
Hence it would be useful to have an option to import a file as ByteArrray
directly.
String does store multibyte variable length characters (utf-8 style), right? It's not just a wchar_t/short/int array, is it?
Citing Itai Seggev:
We internally encode strings in a variant of UTF-8. Now, of course, any byte can be faithfully converted to/from ISO8859-1, but that encoding only equals UTF-8 for the lower 7 bits. For other values, you need to use multiple bytes per character. So using a string to store byte data is both less space efficient and time efficient (since you need to ensure to correct conversion between the two encodings.)