What's purpose of the new function BinarySerialize?
Disclaimer: This answer is written from a user's point of view. For useful insider information on this topic see this discussion with Mathematica developers on Community Forums.
Introduction
Binary serialization is rewriting expressions as an array of bytes (list of integers from Range[0, 255]
).
Binary representation of expression takes less space than a textual one and also can be exported and imported faster than text.
How do Compress
and BinarySerialize
functions work?
Compress
(with default options) always does three steps:
- It performs binary serialization.
- It deflates result using
zlib
. - It transforms deflated result to a Base64-encoded text string.
BinarySerialize
performs only binary serialization and sometimes deflates the result using zlib
.
With default options it will decide itself if it wants to deflate or not.
With an option PerformanceGoal -> "Speed"
it will avoid deflation.
With an option PerformanceGoal -> "Size"
it will likely deflate.
BinarySerialize
returns a ByteArray
object.
ByteArray
is something like a packed array of 8-bit integers.
However FullForm of ByteArray is visualized as Base64-encoded text string.
This visualization can be somewhat misleading, because internally ByteArrays are stored and operated in binary form, not as text strings.
Binary serialization algorithms of Compress
and BinarySerialize
Original serialization algorithm of Compress
is described in this answer.
That algorithm is not very optimized for size and produces larger-then-necessary output for many typical expressions.
For example, it has no support for packed arrays of integers and rewrites such arrays as nested lists, which take a lot bytes.
BinarySerialize
uses a more size-optimized binary serialization algorithm compared to what Compress
(with default options) does. This algorithm supports packed arrays of integers, has optimizations for integers of different size (8,16,32 bit),
stores big integers in binary form (not as text strings), and has other optimizations.
Applications of BinarySerialize
Using BinarySerialize
we can write our own Compress
-like functions with better compression.
For example we can write myCompress
function which does the same three steps as original Compress
,
but uses BinarySerialize
for the serialization step:
myCompress[expr_]:=Module[
{compressedBinaryData},
compressedBinaryData = BinarySerialize[expr, PerformanceGoal->"Size"];
Developer`EncodeBase64[compressedBinaryData]
];
myUncompress[string_]:=Module[
{binaryData},
binaryData = Developer`DecodeBase64ToByteArray[string];
BinaryDeserialize[binaryData]
];
Even for simple integer list we can see size reduction.
Compress[Range[100]] // StringLength
(* 290 *)
myCompress[Range[100]] // StringLength
(* 244 *)
myUncompress[myCompress[Range[100]]] === Range[100]
(* True *)
If we take an expression with large number of small integers we get much more noticeable improvement:
bitmap = Rasterize[Plot[x, {x, 0, 1}]];
StringLength[Compress[bitmap]]
(*31246*)
StringLength[myCompress[bitmap]]
(*17820*)
myUncompress[myCompress[bitmap]] === bitmap
(* True *)
Conclusion
The example above shows that the result of a simple user-defined function myCompress
based on a BinarySerialize
can be almost twice more compact than the result of Compress
.
Outlook
To decrease the output size even further one can use a compression algorithm with higher compression settings (in the second step)
or use Ascii85
-encoding instead of Base64
in the third step.
Appendix 1: Undocumented options of Compress
I have noticed that in Version 11.1 Compress
has more undocumented options than in previous versions. Those options allows one to:
Disable both compression and Base64 encoding and return a binary serialized result as a string with unprintable characters:
Compress[Range[100], Method -> {"Version" -> 4}]
Change binary serialization algorithm to a more efficient one, but not exactly to
BinarySerialize
.Compress[Range[100], Method -> {"Version" -> 6}] // StringLength
(* 254 *)
There is also a "ByteArray"
option shown in usage message ??Compress
but it does not work in Version 11.1.
Note that this behavior is undocumented and may change in future versions.
Appendix 2: Compression option of BinarySerialize
Just for fun one can manually compress result of BinarySerialize[..., PerformanceGoal -> "Speed"]
to get
the same output as BinarySerialize[..., PerformanceGoal -> "Size"]
produces.
This can be done with the following code:
myBinarySerializeSize[expr_]:=Module[
{binaryData, dataBytes, compressedBytes},
binaryData = Normal[BinarySerialize[expr, PerformanceGoal->"Speed"]];
dataBytes = Drop[binaryData, 2]; (*remove magic "7:"*)
compressedBytes = Developer`RawCompress[dataBytes];
ByteArray[Join[ToCharacterCode["7C:"], compressedBytes]]
]
We can check that it gives the same result as PerformanceGoal -> "Size"
option
data = Range[100];
myBinarySerializeSize[data] === BinarySerialize[data, PerformanceGoal -> "Size"]
Appendix 3: zlib
compression functions
Description of undocumented zlib
compression/decompression functions Developer`RawCompress
and Developer`RawUncompress
can be found in this answer.
Appendix 4: Base64
encoding functions
Usage of Base64
encoding/decoding functions from the Developer`
context can be explained using the following code:
binaryData = Range[0, 255];
Normal[
Developer`DecodeBase64ToByteArray[
Developer`EncodeBase64[binaryData]
]
] == binaryData
(* True *)
Measure the difference in bytes with ByteCount
.
BinarySerialize[Range[100]] // ByteCount
901
Range[100] // ByteCount
936
BinarySerialize
will occupy fewer bytes. Not a big difference in this trivial example but the difference generally widens as the object to be serialised gets larger.
Also you must Compress
both for apples-to-apples comparison.
Compress@BinarySerialize[Range[100]] // ByteCount
336
Compress@Range[100] // ByteCount
368
Hope this helps.
Here's an interesting complement to Edmund's answer. It's not always the case that the ByteArray
form will be more compact. DocFind
is just a table of reflinks I use for doc searching.
In[155]:= bs = BinarySerialize@DocFind[]; // RepeatedTiming
Out[155]= {1.8, Null}
In[156]:= cp = Compress@DocFind[]; // RepeatedTiming
Out[156]= {1.74, Null}
In[157]:= bs // ByteCount
Out[157]= 3027593
In[158]:= cp // ByteCount
Out[158]= 233104
On the other hand it's minimally faster to BinaryDeserialize
:
In[159]:= Uncompress@cp; // RepeatedTiming
Out[159]= {0.34, Null}
In[160]:= BinaryDeserialize@bs; // RepeatedTiming
Out[160]= {0.30, Null}
If we use a much larger input we get a much more interesting result, though:
In[145]:= bs2 = BinarySerialize@Range[10000000]; // AbsoluteTiming
Out[145]= {0.16928, Null}
In[153]:= cp2 = Compress@Range[10000000]; // AbsoluteTiming
Out[153]= {4.92081, Null}
My intuition would suggest that BinarySerialize
should almost always be as fast or faster than Compress
but I'd be interested to be proven wrong.
Here the serialized version is much more compact:
In[161]:= bs2 // ByteCount
Out[161]= 80000104
In[162]:= bs3 // ByteCount
Out[162]= 31276128
And it's so much faster to use BinaryDeserialize
here than Uncompress
:
In[163]:= BinaryDeserialize@bs2; // RepeatedTiming
Out[163]= {0.13, Null}
In[164]:= Uncompress@cp2; // AbsoluteTiming
Out[164]= {29.8372, Null}
Again, because it's a byte format, my intuition would suggest Binary*
will be faster and often more compact than a compressed string.
Update: yode confirmed that it is platform independent
On the other hand, since it's a byte format I would also believe it's platform dependent -- i.e., I can't take my serialized ByteArray
and mail it to you and expect it to work if we have different operating systems/architecture, the same way I can't expect that of .mx files.