Python: UnicodeDecodeError: 'utf8' codec can't decode byte

This will solve your issues:

import codecs

f = codecs.open(dir+location, 'r', encoding='utf-8')
txt = f.read()

from that moment txt is in unicode format and you can use it everywhere in your code.

If you want to generate UTF-8 files after your processing do:

f.write(txt.encode('utf-8'))

as I said on the mailinglist, it is probably easiest to use the charset_error option and set it to ignore. If the file is actually utf-16, you can also set the charset to utf-16 in the Vectorizer. See the docs.

Python: UnicodeDecodeError: 'utf8' codec can't decode byte

Tags:

Python

Encoding

Utf 8

Scikit Learn

Related

Recent Posts