What algorithm do the Compress and Uncompress functions use?

It is using the zlib format followed by Base64 coding, and then preceding the resulting string with "1:". So to use it externally, you can strip the "1:", do Base64 decoding, and feed the result of that to a zlib decoder.

However what you get out may not be immediately useful. I compressed the result of D[x^x, {x,9}], like one of the examples in the Compress[] documentation, and then decompressed (successfully) with zlib. I got what appears to be some sort of internal encoding. E.g. "!boRf" 0xa0, 0, 0, 0, "s", 0x04, 0, 0, 0, "Plusf", 0x03, ... (where the numbers are unprintable bytes).

If you want something interoperable, then use "GZIP" or "BZIP2" in ImportString and ExportString. For example, using a 100,000,000 byte excerpt of English from Wikipedia:

ExportString[enwik8,"GZIP"]//StringLength

36548933

ExportString[enwik8,"BZIP2"]//StringLength

29008736

Then you will also get to control the encoding of the data into a string to be compressed. And you can decide whether or not to encode the compressed data into a printable form, or leave it as binary.

After a bit of poking around, it looks like the binary format is pretty simple to parse. Mark Adler's answer is correct - the strings Compress[] returns are just zlib-compressed data. If you have Python installed, this function should take a compressed string and return the actual serialized bytes:

pyDecompress[c_] := StringDrop[StringDrop[StringTrim[RunProcess[{
  "python", "-c", 
  "import sys,zlib,base64; \
   print(zlib.decompress(base64.b64decode(sys.stdin.read().strip()[2:])))"
  }, "StandardOutput", c]], 2], -1]

Binary data starts with the header !boR (21 62 6f 52), followed by the serialized objects. There are apparently only eight types of objects that get serialized (unless I'm missing something):

Machine-precision integers up to 32 bits: i followed by a reversed (that is, little-endian encoded) 32-bit value. The integer 192635 (0x2f07b), for instance, gets encoded as 7b f0 02 00.
Strings: S followed by a reversed 32-bit value for size and the actual ASCII string. 8-bit non-ASCII characters are encoded as \\ followed by a three-digit number. It doesn't appear to be a Unicode offset (e.g. U+00BF encodes as 277, U+00C0 encodes as 300...). Everything else is encoded as UTF-16 in hexadecimal, preceded by \\: (e.g. U+057B becomes \\:057B and the astral plane character U+1F4A3 becomes \\:D83D\\:DCA3).
Symbols: encoded just like strings, but with the lowercase prefix s instead of S.
Arbitrary-precision integers: the actual base-10 digits encoded as a string, with the prefix I.
Machine-precision real numbers: r followed by a reversed IEEE binary64 encoded floating point number. 1.0/3.0, for instance, encodes to 55 55 55 55 55 55 d5 3f.
Arbitrary-precision real numbers: R followed by another string-style encoding: a reversed 32-bit length and the number in an expanded format. The number 13530274.2118781153, for instance, becomes the string 1.35302742118781153`17.131306598334415*^7.
Expressions: f followed by a reversed 32-bit value for the number of parts the expression has, the head of the expression encoded as a symbol, and the parts themselves. A[1,2], for instance, would be encoded as f<02 00 00 00>[A][3][12], with [A] being s<01 00 00 00>A, [3] being i<03 00 00 00> and [12] being i<0c 00 00 00>.
Real matrices: a special encoding is used for large (>249 values) n-dimensional matrices (that is, lists or nested lists) of machine-precision real numbers. Starts with e, followed by a reversed 32-bit value for the n number, n more values for each dimension and then the binary64 encoded real numbers themselves, without any spacing or prefixes. There doesn't seem to be an equivalent for integers or arbitrary-precision reals.

Update: here is my attempt at a JavaScript parser for this format. Requires a fairly new browser.

What algorithm do the Compress and Uncompress functions use?

Tags:

Compression

Implementation Details

Related

Recent Posts