How to create a temporary file with Unicode encoding?
Everyone else's answers are correct, I just want to clarify what's going on:
The difference between the literal 'foo'
and the literal u'foo'
is that the former is a string of bytes and the latter is the Unicode object.
First, understand that Unicode is the character set. UTF-8 is the encoding. The Unicode object is the about the former—it's a Unicode string, not necessarily a UTF-8 one. In your case, the encoding for a string literal will be UTF-8, because you specified it in the first lines of the file.
To get a Unicode string from a byte string, you call the .encode()
method:
>>>> u"ひらがな".encode("utf-8") == "ひらがな"
True
Similarly, you could call your string.encode in the write
call and achieve the same effect as just removing the u
.
If you didn't specify the encoding in the top, say if you were reading the Unicode data from another file, you would specify what encoding it was in before it reached a Python string. This would determine how it would be represented in bytes (i.e., the str
type).
The error you're getting, then, is only because the tempfile
module is expecting a str
object. This doesn't mean it can't handle unicode, just that it expects you to pass in a byte string rather than a Unicode object—because without you specifying an encoding, it wouldn't know how to write it to the temp file.
tempfile.TemporaryFile has encoding option in Python 3:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile(mode='w+', encoding='utf-8') as fh:
fh.write("Hello World: ä")
fh.seek(0)
for line in fh:
print(line)
Note that now you need to specify mode='w+' instead of the default binary mode. Also note that string literals are implicitly Unicode in Python 3, there's no u modifier.
If you're stuck with Python 2.6, temporary files are always binary, and you need to encode the Unicode string before writing it to the file:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
fh.write(u"Hello World: ä".encode('utf-8'))
fh.seek(0)
for line in fh:
print line.decode('utf-8')
Unicode specifies the character set, not the encoding, so in either case you need a way to specify how to encode the Unicode characters!
Since I am working on a Python program with TemporaryFile objects that should run in both Python 2 and Python 3, I don't find it satisfactory to manually encode all strings written as UTF-8 like the other answers suggest.
Instead, I have written the following small polyfill (because I could not find something like it in six) to wrap a binary file-like object into a UTF-8 file-like object:
from __future__ import unicode_literals
import sys
import codecs
if sys.hexversion < 0x03000000:
def uwriter(fp):
return codecs.getwriter('utf-8')(fp)
else:
def uwriter(fp):
return fp
It is used in the following way:
# encoding: utf-8
from tempfile import NamedTemporaryFile
with uwriter(NamedTemporaryFile(suffix='.txt', mode='w')) as fp:
fp.write('Hællo wörld!\n')