How to create a temporary file with Unicode encoding?

Everyone else's answers are correct, I just want to clarify what's going on:

The difference between the literal 'foo' and the literal u'foo' is that the former is a string of bytes and the latter is the Unicode object.

First, understand that Unicode is the character set. UTF-8 is the encoding. The Unicode object is the about the former—it's a Unicode string, not necessarily a UTF-8 one. In your case, the encoding for a string literal will be UTF-8, because you specified it in the first lines of the file.

To get a Unicode string from a byte string, you call the .encode() method:

>>>> u"ひらがな".encode("utf-8") == "ひらがな"
True

Similarly, you could call your string.encode in the write call and achieve the same effect as just removing the u.

If you didn't specify the encoding in the top, say if you were reading the Unicode data from another file, you would specify what encoding it was in before it reached a Python string. This would determine how it would be represented in bytes (i.e., the str type).

The error you're getting, then, is only because the tempfile module is expecting a str object. This doesn't mean it can't handle unicode, just that it expects you to pass in a byte string rather than a Unicode object—because without you specifying an encoding, it wouldn't know how to write it to the temp file.

tempfile.TemporaryFile has encoding option in Python 3:

#!/usr/bin/python3
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile(mode='w+', encoding='utf-8') as fh:
  fh.write("Hello World: ä")
  fh.seek(0)
  for line in fh:
    print(line)

Note that now you need to specify mode='w+' instead of the default binary mode. Also note that string literals are implicitly Unicode in Python 3, there's no u modifier.

If you're stuck with Python 2.6, temporary files are always binary, and you need to encode the Unicode string before writing it to the file:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import tempfile
with tempfile.TemporaryFile() as fh:
  fh.write(u"Hello World: ä".encode('utf-8'))
  fh.seek(0)
  for line in fh:
    print line.decode('utf-8')

Unicode specifies the character set, not the encoding, so in either case you need a way to specify how to encode the Unicode characters!

Since I am working on a Python program with TemporaryFile objects that should run in both Python 2 and Python 3, I don't find it satisfactory to manually encode all strings written as UTF-8 like the other answers suggest.

Instead, I have written the following small polyfill (because I could not find something like it in six) to wrap a binary file-like object into a UTF-8 file-like object:

from __future__ import unicode_literals
import sys
import codecs
if sys.hexversion < 0x03000000:
    def uwriter(fp):
        return codecs.getwriter('utf-8')(fp)
else:
    def uwriter(fp):
        return fp

It is used in the following way:

# encoding: utf-8
from tempfile import NamedTemporaryFile
with uwriter(NamedTemporaryFile(suffix='.txt', mode='w')) as fp:
    fp.write('Hællo wörld!\n')

How to create a temporary file with Unicode encoding?

Tags:

Python

Unicode

Temporary Files

Related

Recent Posts