Applying function to columns of a Pandas DataFrame, conditional on data type

This comment is correct. This behaviour is by design. Pandas "applies" the type that is highest up in the type hierarchy for all dtypes given.

Consider applying the function to only "A",

df[['A']].apply(dtype_fn)
int64

A    int64
dtype: object

And similarly, with only "A" and "B",

df[['A', 'B']].apply(dtype_fn)
float64
float64

A    float64
B    float64
dtype: object

Since you have multiple types, including string in your original DataFrame, the common type for them all is object.

Now this explains the behaviour, but I still need to address the fix. Pandas offers a useful method: Series.infer_objects which infers the dtype and performs a "soft conversion".

If you really need the type in the function, you can perform a soft cast before calling dtype. This produces the expected result:

def dtype_fn(the_col):
     the_col = the_col.infer_objects()
     print(the_col.dtype)

     return(the_col.dtype)

df.apply(dtype_fn)
int64
float64
object
bool

A      int64
B    float64
C     object
D       bool
dtype: object

The actual input to your dtype_fn is a Pandas Series object. You can access the underlying type by modifying your method slightly.

def dtype_fn(the_col):
    print(the_col.values.dtype)
    return(the_col.values.dtype)

For more info about why this is the case, you can have a look at this answer. There it says

This is not an error but is due to the numpy dtype representation: https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html.

Applying function to columns of a Pandas DataFrame, conditional on data type

Tags:

Python

Pandas

Related

Recent Posts