Unicode (UTF-8) reading and writing to files in Python
Rather than mess with .encode
and .decode
, specify the encoding when opening the file. The io
module, added in Python 2.6, provides an io.open
function, which allows specifying the file's encoding
.
Supposing the file is encoded in UTF-8, we can use:
>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")
Then f.read
returns a decoded Unicode object:
>>> f.read()
u'Capit\xe1l\n\n'
In 3.x, the io.open
function is an alias for the built-in open
function, which supports the encoding
argument (it does not in 2.x).
We can also use open
from the codecs
standard library module:
>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'
Note, however, that this can cause problems when mixing read()
and readline()
.
Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')
[Edit on 2016-02-10 for requested clarification]
Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open
open(file, mode='r', buffering=-1,
encoding=None, errors=None, newline=None,
closefd=True, opener=None)
Encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.
So by adding encoding='utf-8'
as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)
In the notation u'Capit\xe1n\n'
(should be just 'Capit\xe1n\n'
in 3.x, and must be in 3.0 and 3.1), the \xe1
represents just one character. \x
is an escape sequence, indicating that e1
is in hexadecimal.
Writing Capit\xc3\xa1n
into the file in a text editor means that it actually contains \xc3\xa1
. Those are 8 bytes and the code reads them all. We can see this by displaying the result:
# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'
# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
Instead, just input characters like á
in the editor, which should then handle the conversion to UTF-8 and save it.
In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape
codec:
# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán
The result is a str
that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1
in the original string. To get a unicode
result, decode again with UTF-8.
In 3.x, the string_escape
codec is replaced with unicode_escape
, and it is strictly enforced that we can only encode
from a str
to bytes
, and decode
from bytes
to str
. unicode_escape
needs to start with a bytes
in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3
and \xa1
as character escapes rather than byte escapes. As a result, we have to do a bit more work:
# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'