Process escape sequences in a string in Python
unicode_escape
doesn't work in general
It turns out that the string_escape
or unicode_escape
solution does not work in general -- particularly, it doesn't work in the presence of actual Unicode.
If you can be sure that every non-ASCII character will be escaped (and remember, anything beyond the first 128 characters is non-ASCII), unicode_escape
will do the right thing for you. But if there are any literal non-ASCII characters already in your string, things will go wrong.
unicode_escape
is fundamentally designed to convert bytes into Unicode text. But in many places -- for example, Python source code -- the source data is already Unicode text.
The only way this can work correctly is if you encode the text into bytes first. UTF-8 is the sensible encoding for all text, so that should work, right?
The following examples are in Python 3, so that the string literals are cleaner, but the same problem exists with slightly different manifestations on both Python 2 and 3.
>>> s = 'naïve \\t test'
>>> print(s.encode('utf-8').decode('unicode_escape'))
naïve test
Well, that's wrong.
The new recommended way to use codecs that decode text into text is to call codecs.decode
directly. Does that help?
>>> import codecs
>>> print(codecs.decode(s, 'unicode_escape'))
naïve test
Not at all. (Also, the above is a UnicodeError on Python 2.)
The unicode_escape
codec, despite its name, turns out to assume that all non-ASCII bytes are in the Latin-1 (ISO-8859-1) encoding. So you would have to do it like this:
>>> print(s.encode('latin-1').decode('unicode_escape'))
naïve test
But that's terrible. This limits you to the 256 Latin-1 characters, as if Unicode had never been invented at all!
>>> print('Ernő \\t Rubik'.encode('latin-1').decode('unicode_escape'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0151'
in position 3: ordinal not in range(256)
Adding a regular expression to solve the problem
(Surprisingly, we do not now have two problems.)
What we need to do is only apply the unicode_escape
decoder to things that we are certain to be ASCII text. In particular, we can make sure only to apply it to valid Python escape sequences, which are guaranteed to be ASCII text.
The plan is, we'll find escape sequences using a regular expression, and use a function as the argument to re.sub
to replace them with their unescaped value.
import re
import codecs
ESCAPE_SEQUENCE_RE = re.compile(r'''
( \\U........ # 8-digit hex escapes
| \\u.... # 4-digit hex escapes
| \\x.. # 2-digit hex escapes
| \\[0-7]{1,3} # Octal escapes
| \\N\{[^}]+\} # Unicode characters by name
| \\[\\'"abfnrtv] # Single-character escapes
)''', re.UNICODE | re.VERBOSE)
def decode_escapes(s):
def decode_match(match):
return codecs.decode(match.group(0), 'unicode-escape')
return ESCAPE_SEQUENCE_RE.sub(decode_match, s)
And with that:
>>> print(decode_escapes('Ernő \\t Rubik'))
Ernő Rubik
The actually correct and convenient answer for python 3:
>>> import codecs
>>> myString = "spam\\neggs"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
spam
eggs
>>> myString = "naïve \\t test"
>>> print(codecs.escape_decode(bytes(myString, "utf-8"))[0].decode("utf-8"))
naïve test
Details regarding codecs.escape_decode
:
codecs.escape_decode
is a bytes-to-bytes decodercodecs.escape_decode
decodes ascii escape sequences, such as:b"\\n"
->b"\n"
,b"\\xce"
->b"\xce"
.codecs.escape_decode
does not care or need to know about the byte object's encoding, but the encoding of the escaped bytes should match the encoding of the rest of the object.
Background:
- @rspeer is correct:
unicode_escape
is the incorrect solution for python3. This is becauseunicode_escape
decodes escaped bytes, then decodes bytes to unicode string, but receives no information regarding which codec to use for the second operation. - @Jerub is correct: avoid the AST or eval.
- I first discovered
codecs.escape_decode
from this answer to "how do I .decode('string-escape') in Python3?". As that answer states, that function is currently not documented for python 3.
The correct thing to do is use the 'string-escape' code to decode the string.
>>> myString = "spam\\neggs"
>>> decoded_string = bytes(myString, "utf-8").decode("unicode_escape") # python3
>>> decoded_string = myString.decode('string_escape') # python2
>>> print(decoded_string)
spam
eggs
Don't use the AST or eval. Using the string codecs is much safer.