How to protect myself from a gzip or bzip2 bomb?
You could use resource
module to limit resources available to your process and its children.
If you need to decompress in memory then you could set resource.RLIMIT_AS
(or RLIMIT_DATA
, RLIMIT_STACK
) e.g., using a context manager to automatically restore it to a previous value:
import contextlib
import resource
@contextlib.contextmanager
def limit(limit, type=resource.RLIMIT_AS):
soft_limit, hard_limit = resource.getrlimit(type)
resource.setrlimit(type, (limit, hard_limit)) # set soft limit
try:
yield
finally:
resource.setrlimit(type, (soft_limit, hard_limit)) # restore
with limit(1 << 30): # 1GB
# do the thing that might try to consume all memory
If the limit is reached; MemoryError
is raised.
This will determine the uncompressed size of the gzip stream, while using limited memory:
#!/usr/bin/python
import sys
import zlib
f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
while True:
buf = z.unconsumed_tail
if buf == "":
buf = f.read(1024)
if buf == "":
break
got = z.decompress(buf, 4096)
if got == "":
break
total += len(got)
print total
if z.unused_data != "" or f.read(1024) != "":
print "warning: more input after end of gzip stream"
It will return a slight overestimate of the space required for all of the files in the tar file in when extracted. The length includes those files, as well as the tar directory information.
The gzip.py code does not control the amount of data decompressed, except by virtue of the size of the input data. In gzip.py, it reads 1024 compressed bytes at a time. So you can use gzip.py if you're ok with up to about 1056768 bytes of memory usage for the uncompressed data (1032 * 1024, where 1032:1 is the maximum compression ratio of deflate). The solution here uses zlib.decompress
with the second argument, which limits the amount of uncompressed data. gzip.py does not.
This will accurately determine the total size of the extracted tar entries by decoding the tar format:
#!/usr/bin/python
import sys
import zlib
def decompn(f, z, n):
"""Return n uncompressed bytes, or fewer if at the end of the compressed
stream. This only decompresses as much as necessary, in order to
avoid excessive memory usage for highly compressed input.
"""
blk = ""
while len(blk) < n:
buf = z.unconsumed_tail
if buf == "":
buf = f.read(1024)
got = z.decompress(buf, n - len(blk))
blk += got
if got == "":
break
return blk
f = open(sys.argv[1], "rb")
z = zlib.decompressobj(15+16)
total = 0
left = 0
while True:
blk = decompn(f, z, 512)
if len(blk) < 512:
break
if left == 0:
if blk == "\0"*512:
continue
if blk[156] in ["1", "2", "3", "4", "5", "6"]:
continue
if blk[124] == 0x80:
size = 0
for i in range(125, 136):
size <<= 8
size += blk[i]
else:
size = int(blk[124:136].split()[0].split("\0")[0], 8)
if blk[156] not in ["x", "g", "X", "L", "K"]:
total += size
left = (size + 511) // 512
else:
left -= 1
print total
if blk != "":
print "warning: partial final block"
if left != 0:
print "warning: tar file ended in the middle of an entry"
if z.unused_data != "" or f.read(1024) != "":
print "warning: more input after end of gzip stream"
You could use a variant of this to scan the tar file for bombs. This has the advantage of finding a large size in the header information before you even have to decompress that data.
As for .tar.bz2 archives, the Python bz2 library (at least as of 3.3) is unavoidably unsafe for bz2 bombs consuming too much memory. The bz2.decompress
function does not offer a second argument like zlib.decompress
does. This is made even worse by the fact that the bz2 format has a much, much higher maximum compression ratio than zlib due to run-length coding. bzip2 compresses 1 GB of zeros to 722 bytes. So you cannot meter the output of bz2.decompress
by metering the input as can be done with zlib.decompress
even without the second argument. The lack of a limit on the decompressed output size is a fundamental flaw in the Python interface.
I looked in the _bz2module.c in 3.3 to see if there is an undocumented way to use it to avoid this problem. There is no way around it. The decompress
function in there just keeps growing the result buffer until it can decompress all of the provided input. _bz2module.c needs to be fixed.