Python CSV DictReader with UTF-8 data

A classed based approach to @LMatter answer, with this approach you still get all the benefits of DictReader such as getting the fieldnames and getting the line number plus it handles UTF-8

import csv

class UnicodeDictReader(csv.DictReader, object):

    def next(self):
        row = super(UnicodeDictReader, self).next()
        return {unicode(key, 'utf-8'): unicode(value, 'utf-8') for key, value in row.iteritems()}

First of all, use the 2.6 version of the documentation. It can change for each release. It says clearly that it doesn't support Unicode but it does support UTF-8. Technically, these are not the same thing. As the docs say:

The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.

The example below (from the docs) shows how to create two functions that correctly read text as UTF-8 as CSV. You should know that csv.reader() always returns a DictReader object.

import csv

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.DictReader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

I came up with an answer myself:

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()}

_{Note: This has been updated so keys are decoded per the suggestion in the comments}

For me, the key was not in manipulating the csv DictReader args, but the file opener itself. This did the trick:

with open(filepath, mode="r", encoding="utf-8-sig") as csv_file:
    csv_reader = csv.DictReader(csv_file)

No special class required. Now I can open files either with or without BOM without crashing.

Python CSV DictReader with UTF-8 data

Tags:

Python

Csv

Unicode

Related

Recent Posts