byte string vs. unicode string. Python

Here's an attempt at a simple explanation that only applies to Python 3. I hope that coming from a lay person, it would help to clear some confusion for the completely uninitiated. If there are any technical inaccuracies, please forgive me and feel free to point it out.

Suppose you create a string using Python 3 in the usual way:

stringobject = 'ant'

stringobject would be a unicode string.

A unicode string is made up of unicode characters. In stringobject above, the unicode characters are the individual letters, e.g. a, n, t

Each unicode character is assigned a code point, which can be expressed as a sequence of hex digits (a hex digit can take on 16 values, ranging from 0-9 and A-F). For instance, the letter 'a' is equivalent to '\u0061', and 'ant' is equivalent to '\u0061\u006E\u0074'.

So you will find that if you type in,

stringobject = '\u0061\u006E\u0074'
stringobject

You will also get the output 'ant'.

Now, unicode is converted to bytes, in a process known as encoding. The reverse process of converting bytes to unicode is known as decoding.

How is this done? Since each hex digit can take on 16 different values, it can be reflected in a 4-bit binary sequence (e.g. the hex digit 0 can be expressed in binary as 0000, the hex digit 1 can be expressed as 0001 and so forth). If a unicode character has a code point consisting of four hex digits, it would need a 16-bit binary sequence to encode it.

Different encoding systems specify different rules for converting unicode to bits. Most importantly, encodings differ in the number of bits they use to express each unicode character.

For instance, the ASCII encoding system uses only 8 bits (1 byte) per character. Thus it can only encode unicode characters with code points up to two hex digits long (i.e. 256 different unicode characters). The UTF-8 encoding system uses 8 to 32 bits (1 to 4 bytes) per character, so it can encode unicode characters with code points up to 8 hex digits long, i.e. everything.

Running the following code:

byteobject = stringobject.encode('utf-8')
byteobject, type(byteobject)

converts a unicode string into a byte string using the utf-8 encoding system, and returns b'ant', bytes'.

Note that if you used 'ASCII' as the encoding system, you wouldn't run into any problems since all code points in 'ant' can be expressed with 1 byte. But if you had a unicode string containing characters with code points longer than two hex digits, you would get a UnicodeEncodeError.

Similarly,

stringobject = byteobject.decode('utf-8')
stringobject, type(stringobject)

gives you 'ant', str.


No, Python does not use its own encoding - it will use any encoding that it has access to and that you specify.

A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters.

bytes objects give you access to the underlying bytes. str objects have the encode method that takes a string representing an encoding and returns the bytes object that represents the string in that encoding. bytes objects have the decode method that takes a string representing an encoding and returns the str that results from interpreting the byte as a string encoded in the the given encoding.

For example:

>>> a = "αά".encode('utf-8')
>>> a
b'\xce\xb1\xce\xac'
>>> a.decode('utf-8')
'αά'

We can see that UTF-8 is using four bytes, \xce, \xb1, \xce, and \xac, to represent two characters.

Related reading:

  • Python Unicode Howto (from the official documentation)

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

  • Pragmatic Unicode by Ned Batchelder