Pandas read_csv ignoring column dtypes when I pass skip_footer arg

Unfortunately using converters or newer pandas versions doesn't solve the more general problem of always ensuring that read_csv doesn't infer a float64 dtype. With pandas 0.15.2, the following example, with a CSV containing integers in hexadecimal notation with NULL entries, shows that using converters for what the name implies they should be used for, interferes with dtype specification.

In [1]: df = pd.DataFrame(dict(a = ["0xff", "0xfe"], b = ["0xfd", None], c = [None, "0xfc"], d = [None, None]))
In [2]: df.to_csv("H:/tmp.csv", index = False)
In [3]: ef = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "abcd"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "abcd"})
In [4]: ef.dtypes.map(lambda x: x)
Out[4]:
a      int64
b    float64
c    float64
d     object
dtype: object

The specified dtype of object is only respected for the all-NULL column. In this case, the float64 values can just be converted to integers, but by the pigeon hole principle, not all 64 bit integers can be represented as a float64.

The best solution I have found for this more general case is to get pandas to read potentially problematic columns as strings, as already covered, then convert the slice with values that need conversion (and not mapping the conversion on the column, as that will again result in an automatic dtype = float64 inference).

In [5]: ff = pd.read_csv("H:/tmp.csv", dtype = {c: object for c in "bc"}, converters = {c: lambda x: None if x == "" else int(x, 16) for c in "ad"})
In [6]: ff.dtypes
Out[6]:
a     int64
b    object
c    object
d    object
dtype: object
In [7]: for c in "bc":
   .....:     ff.loc[~pd.isnull(ff[c]), c] = ff[c][~pd.isnull(ff[c])].map(lambda x: int(x, 16))
   .....:
In [8]: ff.dtypes
Out[8]:
a     int64
b    object
c    object
d    object
dtype: object
In [9]: [(ff[c][i], type(ff[c][i])) for c in ff.columns for i in ff.index]
Out[9]:
[(255, numpy.int64),
 (254, numpy.int64),
 (253L, long),
 (nan, float),
 (nan, float),
 (252L, long),
 (None, NoneType),
 (None, NoneType)]

As far as I have been able to determine, at least up to version 0.15.2 there is no way to avoid postprocessing of string values in situations like this.

Pandas 0.13.1 silently ignored the dtype argument because the c engine does not support skip_footer. This caused Pandas to fall back to the python engine which does not support dtype.

Solution? Use converters

df = pd.read_csv('SomeFile.csv', 
                 header=1,
                 skip_footer=1, 
                 usecols=[2, 3], 
                 converters={'CUSTOMER': str, 'ORDER NO': str},
                 engine='python')

Output:

In [1]: df.dtypes
Out[2]:
CUSTOMER    object
ORDER NO    object
dtype: object

In [3]: type(df['CUSTOMER'][0])
Out[4]: str

In [5]: df.head()
Out[6]:
  CUSTOMER ORDER NO
0    03106   253734
1    03156   290550
2    03175   262207
3    03175   262207
4    03175   262207

Leading 0's from the original file are preserved and all data is stored as strings.

Pandas read_csv ignoring column dtypes when I pass skip_footer arg

Tags:

Python

Pandas

Csv

Python 2.7

Related

Recent Posts