How to open an unicode text file inside a zip?
edit For Python 3, using io.TextIOWrapper
as this answer describes is the best choice. The answer below could still be helpful for 2.x. I don't think anything below is actually incorrect even for 3.x, but io.TestIOWrapper
is still better.
If the file is utf-8, this will work:
# the rest of the code as above, then:
with zfile.open(name, 'rU') as readFile:
line = readFile.readline().decode('utf8')
# etc
If you're going to be iterating over the file you can use codecs.iterdecode
, but that won't work with readline()
.
with zfile.open(name, 'rU') as readFile:
for line in codecs.iterdecode(readFile, 'utf8'):
print line
# etc
Note that neither approach is necessarily safe for multibyte encodings. For example, little-endian UTF-16 represents the newline character with the bytes b'\x0A\x00'
. A non-unicode aware tool looking for newlines will split that incorrectly, leaving the null bytes on the following line. In such a case you'd have to use something that doesn't try to split the input by newlines, such as ZipFile.read
, and then decode the whole byte string at once. This is not a concern for utf-8.
To convert a byte stream into Unicode stream, you could use io.TextIOWrapper()
:
encoding = 'utf-8'
with zipfile.ZipFile("5.csv.zip") as zfile:
for name in zfile.namelist():
with zfile.open(name) as readfile:
for line in io.TextIOWrapper(readfile, encoding):
print(repr(line))
Note: TextIOWrapper()
uses universal newline mode by default. rU
mode in zfile.open()
is deprecated since version 3.4.
It avoids issues with multibyte encodings described in @Peter DeGlopper's answer.
The reason why you're seeing that error is because you are trying to mix bytes with unicode. The argument to split
must also be byte-string:
>>> line = b'$0.0\t1822\t1\t1\t1\n'
>>> line.split(b'\t')
[b'$0.0', b'1822', b'1', b'1', b'1\n']
To get a unicode string, use decode:
>>> line.decode('utf-8')
'$0.0\t1822\t1\t1\t1\n'