Pandas read_csv ignoring column dtypes when I pass skip_footer arg
Unfortunately using converters or newer pandas versions doesn't solve the more general problem of always ensuring that read_csv doesn't infer a float64 dtype. With pandas 0.15.2, the following example, with a CSV containing integers in hexadecimal notation with NULL entries, shows that using converters for what the name implies they should be used for, interferes with dtype specification.
In [1]: df = pd.DataFrame(dict(a = ["0xff", "0xfe"], b = ["0xfd", None], c = [None, "0xfc"], d = [None, None]))
In [2]: df.to_csv("H:/tmp.csv", index = False)
In [3]: ef = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "abcd"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "abcd"})
In [4]: ef.dtypes.map(lambda x: x)
Out[4]:
a int64
b float64
c float64
d object
dtype: object
The specified dtype of object is only respected for the all-NULL column. In this case, the float64 values can just be converted to integers, but by the pigeon hole principle, not all 64 bit integers can be represented as a float64.
The best solution I have found for this more general case is to get pandas to read potentially problematic columns as strings, as already covered, then convert the slice with values that need conversion (and not mapping the conversion on the column, as that will again result in an automatic dtype = float64 inference).
In [5]: ff = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "bc"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "ad"})
In [6]: ff.dtypes
Out[6]:
a int64
b object
c object
d object
dtype: object
In [7]: for c in "bc":
.....: ff.loc[~pd.isnull(ff[c]), c] = ff[c][~pd.isnull(ff[c])].map(lambda x: int(x, 16))
.....:
In [8]: ff.dtypes
Out[8]:
a int64
b object
c object
d object
dtype: object
In [9]: [(ff[c][i], type(ff[c][i])) for c in ff.columns for i in ff.index]
Out[9]:
[(255, numpy.int64),
(254, numpy.int64),
(253L, long),
(nan, float),
(nan, float),
(252L, long),
(None, NoneType),
(None, NoneType)]
As far as I have been able to determine, at least up to version 0.15.2 there is no way to avoid postprocessing of string values in situations like this.
Pandas 0.13.1 silently ignored the dtype
argument because the c engine
does not support skip_footer
. This caused Pandas to fall back to the python engine
which does not support dtype
.
Solution? Use converters
df = pd.read_csv('SomeFile.csv',
header=1,
skip_footer=1,
usecols=[2, 3],
converters={'CUSTOMER': str, 'ORDER NO': str},
engine='python')
Output:
In [1]: df.dtypes
Out[2]:
CUSTOMER object
ORDER NO object
dtype: object
In [3]: type(df['CUSTOMER'][0])
Out[4]: str
In [5]: df.head()
Out[6]:
CUSTOMER ORDER NO
0 03106 253734
1 03156 290550
2 03175 262207
3 03175 262207
4 03175 262207
Leading 0's from the original file are preserved and all data is stored as strings.