Reading Excel file is magnitudes slower using openpyxl compared to xlrd
You can just iterate over the sheet:
def UseOpenpyxl(file_name):
wb = openpyxl.load_workbook(file_name, read_only=True)
sheet = wb.active
rows = sheet.rows
first_row = [cell.value for cell in next(rows)]
data = []
for row in rows:
record = {}
for key, cell in zip(first_row, row):
if cell.data_type == 's':
record[key] = cell.value.strip()
else:
record[key] = cell.value
data.append(record)
return data
This should scale to large files. You may want to chunk your result if the list
data
gets too large.
Now the openpyxl version takes about twice as long as the xlrd one:
%timeit xlrd_results = UseXlrd('foo.xlsx')
1 loops, best of 3: 3.38 s per loop
%timeit openpyxl_results = UseOpenpyxl('foo.xlsx')
1 loops, best of 3: 6.87 s per loop
Note that xlrd and openpyxl might interpret what is an integer and what is a float slightly differently. For my test data, I needed to add float()
to make the outputs comparable:
def UseOpenpyxl(file_name):
wb = openpyxl.load_workbook(file_name, read_only=True)
sheet = wb.active
rows = sheet.rows
first_row = [float(cell.value) for cell in next(rows)]
data = []
for row in rows:
record = {}
for key, cell in zip(first_row, row):
if cell.data_type == 's':
record[key] = cell.value.strip()
else:
record[key] = float(cell.value)
data.append(record)
return data
Now, both versions give the same results for my test data:
>>> xlrd_results == openpyxl_results
True