Applying function to columns of a Pandas DataFrame, conditional on data type
This comment is correct. This behaviour is by design. Pandas "applies" the type that is highest up in the type hierarchy for all dtypes given.
Consider applying the function to only "A",
df[['A']].apply(dtype_fn)
int64
A int64
dtype: object
And similarly, with only "A" and "B",
df[['A', 'B']].apply(dtype_fn)
float64
float64
A float64
B float64
dtype: object
Since you have multiple types, including string in your original DataFrame, the common type for them all is object
.
Now this explains the behaviour, but I still need to address the fix. Pandas offers a useful method: Series.infer_objects
which infers the dtype and performs a "soft conversion".
If you really need the type in the function, you can perform a soft cast before calling dtype
. This produces the expected result:
def dtype_fn(the_col):
the_col = the_col.infer_objects()
print(the_col.dtype)
return(the_col.dtype)
df.apply(dtype_fn)
int64
float64
object
bool
A int64
B float64
C object
D bool
dtype: object
The actual input to your dtype_fn
is a Pandas Series object. You can access the underlying type by modifying your method slightly.
def dtype_fn(the_col):
print(the_col.values.dtype)
return(the_col.values.dtype)
For more info about why this is the case, you can have a look at this answer. There it says
This is not an error but is due to the numpy dtype representation: https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html.