Is it possible to parallelize bz2's decompression?

A pbzip2 stream is nothing more than the concatenation of multiple bzip2 streams.

An example using the shell:

bzip2 < /usr/share/dict/words > words_x_1.bz2
cat words_x_1.bz2{,,,,,,,,,} > words_x_10.bz2
time bzip2 -d < words_x_10.bz2 > /dev/null
time pbzip2 -d < words_x_10.bz2 > /dev/null

I've never used python's bz2 module, but it should be easy to close/reopen a stream in 'a'ppend mode, every so-many bytes, to get the same result. Note that if BZ2File is constructed from an existing file-like object, closing the BZ2File will not close the underlying stream (which is what you want here).

I haven't measured how many bytes is optimal for chunking, but I would guess every 1-20 megabytes - it definitely needs to be larger than the bzip2 block size (900k) though.

Note also that if you record the compressed and uncompressed offsets of each chunk, you can do fairly efficient random access. This is how the dictzip program works, though that is based on gzip.

Is it possible to parallelize bz2's decompression?

Tags:

Python

Python 2.7

Multiprocessing

Bzip2

Bzip

Related

Recent Posts