Python pandas: how to specify data types when reading an Excel file?
You just specify converters. I created an excel spreadsheet of the following structure:
names ages
bob 05
tom 4
suzy 3
Where the "ages" column is formatted as strings. To load:
import pandas as pd
df = pd.read_excel('Book1.xlsx',sheetname='Sheet1',header=0,converters={'names':str,'ages':str})
>>> df
names ages
0 bob 05
1 tom 4
2 suzy 3
Starting with v0.20.0
, the dtype
keyword argument in read_excel()
function could be used to specify the data types that needs to be applied to the columns just like it exists for read_csv()
case.
Using converters
and dtype
arguments together on the same column name would lead to the latter getting shadowed and the former gaining preferance.
1) Inorder for it to not interpret the dtypes
but rather pass all the contents of it's columns as they were originally in the file before, we could set this arg to str
or object
so that we don't mess up our data. (one such case would be leading zeros in numbers which would be lost otherwise)
pd.read_excel('file_name.xlsx', dtype=str) # (or) dtype=object
2) It even supports a dict mapping wherein the keys
constitute the column names and values
it's respective data type to be set especially when you want to alter the dtype
for a subset of all the columns.
# Assuming data types for `a` and `b` columns to be altered
pd.read_excel('file_name.xlsx', dtype={'a': np.float64, 'b': np.int32})
The read_excel() function has a converters argument, where you can apply functions to input in certain columns. You can use this to keep them as strings. Documentation:
Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.
Example code:
pandas.read_excel(my_file, converters = {my_str_column: str})
In case if you are not aware of the number and name of columns in dataframe then this method can be handy:
column_list = []
df_column = pd.read_excel(file_name, 'Sheet1').columns
for i in df_column:
column_list.append(i)
converter = {col: str for col in column_list}
df_actual = pd.read_excel(file_name, converters=converter)
where column_list is the list of your column names.