What is internal representation of string in Python 3.x

The internal representation will change in Python 3.3 which implements PEP 393. The new representation will pick one or several of ascii, latin-1, utf-8, utf-16, utf-32, generally trying to get a compact representation.

Implicit conversions into surrogate pairs will only be done when talking to legacy APIs (those only exist on windows, where wchar_t is two bytes); the Python string will be preserved. Here are the release notes.


In Python 3.3 and above, the internal representation of the string will depend on the string, and can be any of latin-1, UCS-2 or UCS-4, as described in PEP 393.

For previous Pythons, the internal representation depends on the build flags of Python. Python can be built with flag values --enable-unicode=ucs2 or --enable-unicode=ucs4. ucs2 builds do in fact use UTF-16 as their internal representation, and ucs4 builds use UCS-4 / UTF-32.