Reading Excel file is magnitudes slower using openpyxl compared to xlrd

You can just iterate over the sheet:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [cell.value for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = cell.value
        data.append(record)
    return data

This should scale to large files. You may want to chunk your result if the list data gets too large.

Now the openpyxl version takes about twice as long as the xlrd one:

%timeit xlrd_results = UseXlrd('foo.xlsx')
1 loops, best of 3: 3.38 s per loop

%timeit openpyxl_results = UseOpenpyxl('foo.xlsx')
1 loops, best of 3: 6.87 s per loop

Note that xlrd and openpyxl might interpret what is an integer and what is a float slightly differently. For my test data, I needed to add float() to make the outputs comparable:

def UseOpenpyxl(file_name):
    wb = openpyxl.load_workbook(file_name, read_only=True)
    sheet = wb.active
    rows = sheet.rows
    first_row = [float(cell.value) for cell in next(rows)]
    data = []
    for row in rows:
        record = {}
        for key, cell in zip(first_row, row):
            if cell.data_type == 's':
                record[key] = cell.value.strip()
            else:
                record[key] = float(cell.value)
        data.append(record)
    return data

Now, both versions give the same results for my test data:

>>> xlrd_results == openpyxl_results
True

Reading Excel file is magnitudes slower using openpyxl compared to xlrd

Tags:

Python

Xlrd

Openpyxl

Related

Recent Posts