Python - 'ascii' codec can't decode byte

"你好".encode('utf-8')

encode converts a unicode object to a string object. But here you have invoked it on a string object (because you don't have the u). So python has to convert the string to a unicode object first. So it does the equivalent of

"你好".decode().encode('utf-8')

But the decode fails because the string isn't valid ascii. That's why you get a complaint about not being able to decode.

Always encode from unicode to bytes.
In this direction, you get to choose the encoding.

>>> u"你好".encode("utf8")
'\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print _
你好

The other way is to decode from bytes to unicode.
In this direction, you have to know what the encoding is.

>>> bytes = '\xe4\xbd\xa0\xe5\xa5\xbd'
>>> print bytes
你好
>>> bytes.decode('utf-8')
u'\u4f60\u597d'
>>> print _
你好

This point can't be stressed enough. If you want to avoid playing unicode "whack-a-mole", it's important to understand what's happening at the data level. Here it is explained another way:

A unicode object is decoded already, you never want to call decode on it.
A bytestring object is encoded already, you never want to call encode on it.

Now, on seeing .encode on a byte string, Python 2 first tries to implicitly convert it to text (a unicode object). Similarly, on seeing .decode on a unicode string, Python 2 implicitly tries to convert it to bytes (a str object).

These implicit conversions are why you can get UnicodeDecodeError when you've called encode. It's because encoding usually accepts a parameter of type unicode; when receiving a str parameter, there's an implicit decoding into an object of type unicode before re-encoding it with another encoding. This conversion chooses a default 'ascii' decoder^†, giving you the decoding error inside an encoder.

In fact, in Python 3 the methods str.decode and bytes.encode don't even exist. Their removal was a [controversial] attempt to avoid this common confusion.

^† _{...or whatever coding sys.getdefaultencoding() mentions; usually this is 'ascii'}

You can try this

import sys
reload(sys)
sys.setdefaultencoding("utf-8")

You can also try following

Add following line at top of your .py file.

# -*- coding: utf-8 -*-

Python - 'ascii' codec can't decode byte

Tags:

Python

Unicode

Python 2.7

Python 2.X

Python Unicode

Related

Recent Posts