What is a unicode string?
Update: Python 3
In Python 3, Unicode strings are the default. The type str
is a collection of Unicode code points, and the type bytes
is used for representing collections of 8-bit integers (often interpreted as ASCII characters).
Here is the code from the question, updated for Python 3:
>>> my_str = 'A unicode \u018e string \xf1' # no need for "u" prefix
# the escape sequence "\u" denotes a Unicode code point (in hex)
>>> my_str
'A unicode Ǝ string ñ'
# the Unicode code points U+018E and U+00F1 were displayed
# as their corresponding glyphs
>>> my_bytes = my_str.encode('utf-8') # convert to a bytes object
>>> my_bytes
b'A unicode \xc6\x8e string \xc3\xb1'
# the "b" prefix means a bytes literal
# the escape sequence "\x" denotes a byte using its hex value
# the code points U+018E and U+00F1 were encoded as 2-byte sequences
>>> my_str2 = my_bytes.decode('utf-8') # convert back to str
>>> my_str2 == my_str
True
Working with files:
>>> f = open('foo.txt', 'r') # text mode (Unicode)
>>> # the platform's default encoding (e.g. UTF-8) is used to decode the file
>>> # to set a specific encoding, use open('foo.txt', 'r', encoding="...")
>>> for line in f:
>>> # here line is a str object
>>> f = open('foo.txt', 'rb') # "b" means binary mode (bytes)
>>> for line in f:
>>> # here line is a bytes object
Historical answer: Python 2
In Python 2, the str
type was a collection of 8-bit characters (like Python 3's bytes
type). The English alphabet can be represented using these 8-bit characters, but symbols such as Ω, и, ±, and ♠ cannot.
Unicode is a standard for working with a wide range of characters. Each symbol has a code point (a number), and these code points can be encoded (converted to a sequence of bytes) using a variety of encodings.
UTF-8 is one such encoding. The low code points are encoded using a single byte, and higher code points are encoded as sequences of bytes.
To allow working with Unicode characters, Python 2 has a unicode
type which is a collection of Unicode code points (like Python 3's str
type). The line ustring = u'A unicode \u018e string \xf1'
creates a Unicode string with 20 characters.
When the Python interpreter displays the value of ustring
, it escapes two of the characters (Ǝ and ñ) because they are not in the standard printable range.
The line s = unistring.encode('utf-8')
encodes the Unicode string using UTF-8. This converts each code point to the appropriate byte or sequence of bytes. The result is a collection of bytes, which is returned as a str
. The size of s
is 22 bytes, because two of the characters have high code points and are encoded as a sequence of two bytes rather than a single byte.
When the Python interpreter displays the value of s
, it escapes four bytes that are not in the printable range (\xc6
, \x8e
, \xc3
, and \xb1
). The two pairs of bytes are not treated as single characters like before because s
is of type str
, not unicode
.
The line t = unicode(s, 'utf-8')
does the opposite of encode()
. It reconstructs the original code points by looking at the bytes of s
and parsing byte sequences. The result is a Unicode string.
The call to codecs.open()
specifies utf-8
as the encoding, which tells Python to interpret the contents of the file (a collection of bytes) as a Unicode string that has been encoded using UTF-8.