How to open an unicode text file inside a zip?

edit For Python 3, using io.TextIOWrapper as this answer describes is the best choice. The answer below could still be helpful for 2.x. I don't think anything below is actually incorrect even for 3.x, but io.TestIOWrapper is still better.

If the file is utf-8, this will work:

# the rest of the code as above, then:
with zfile.open(name, 'rU') as readFile:
    line = readFile.readline().decode('utf8')
    # etc

If you're going to be iterating over the file you can use codecs.iterdecode, but that won't work with readline().

with zfile.open(name, 'rU') as readFile:
    for line in codecs.iterdecode(readFile, 'utf8'):
        print line
        # etc

Note that neither approach is necessarily safe for multibyte encodings. For example, little-endian UTF-16 represents the newline character with the bytes b'\x0A\x00'. A non-unicode aware tool looking for newlines will split that incorrectly, leaving the null bytes on the following line. In such a case you'd have to use something that doesn't try to split the input by newlines, such as ZipFile.read, and then decode the whole byte string at once. This is not a concern for utf-8.

To convert a byte stream into Unicode stream, you could use io.TextIOWrapper():

encoding = 'utf-8'
with zipfile.ZipFile("5.csv.zip") as zfile:
    for name in zfile.namelist():
        with zfile.open(name) as readfile:
            for line in io.TextIOWrapper(readfile, encoding):
                print(repr(line))

Note: TextIOWrapper() uses universal newline mode by default. rU mode in zfile.open() is deprecated since version 3.4.

It avoids issues with multibyte encodings described in @Peter DeGlopper's answer.

The reason why you're seeing that error is because you are trying to mix bytes with unicode. The argument to split must also be byte-string:

>>> line = b'$0.0\t1822\t1\t1\t1\n'
>>> line.split(b'\t')
[b'$0.0', b'1822', b'1', b'1', b'1\n']

To get a unicode string, use decode:

>>> line.decode('utf-8')
'$0.0\t1822\t1\t1\t1\n'

How to open an unicode text file inside a zip?

Tags:

Python

File Io

Unicode

Python 3.X

Zip

Related

Recent Posts