Python: UnicodeDecodeError: 'utf8' codec can't decode byte
This will solve your issues:
import codecs
f = codecs.open(dir+location, 'r', encoding='utf-8')
txt = f.read()
from that moment txt is in unicode format and you can use it everywhere in your code.
If you want to generate UTF-8 files after your processing do:
f.write(txt.encode('utf-8'))
as I said on the mailinglist, it is probably easiest to use the charset_error
option and set it to ignore
.
If the file is actually utf-16, you can also set the charset to utf-16 in the Vectorizer.
See the docs.