how to check the dtype of a column in python pandas
You can access the data-type of a column with dtype
:
for y in agg.columns:
if(agg[y].dtype == np.float64 or agg[y].dtype == np.int64):
treat_numeric(agg[y])
else:
treat_str(agg[y])
In pandas 0.20.2
you can do:
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
is_string_dtype(df['A'])
>>>> True
is_numeric_dtype(df['B'])
>>>> True
So your code becomes:
for y in agg.columns:
if (is_string_dtype(agg[y])):
treat_str(agg[y])
elif (is_numeric_dtype(agg[y])):
treat_numeric(agg[y])
Asked question title is general, but authors use case stated in the body of the question is specific. So any other answers may be used.
But in order to fully answer the title question it should be clarified that it seems like all of the approaches may fail in some cases and require some rework. I reviewed all of them (and some additional) in decreasing of reliability order (in my opinion):
1. Comparing types directly via ==
(accepted answer).
Despite the fact that this is accepted answer and has most upvotes count, I think this method should not be used at all. Because in fact this approach is discouraged in python as mentioned several times here.
But if one still want to use it - should be aware of some pandas-specific dtypes like pd.CategoricalDType
, pd.PeriodDtype
, or pd.IntervalDtype
. Here one have to use extra type( )
in order to recognize dtype correctly:
s = pd.Series([pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')])
s
s.dtype == pd.PeriodDtype # Not working
type(s.dtype) == pd.PeriodDtype # working
>>> 0 2002-03-01
>>> 1 2012-02-01
>>> dtype: period[D]
>>> False
>>> True
Another caveat here is that type should be pointed out precisely:
s = pd.Series([1,2])
s
s.dtype == np.int64 # Working
s.dtype == np.int32 # Not working
>>> 0 1
>>> 1 2
>>> dtype: int64
>>> True
>>> False
2. isinstance()
approach.
This method has not been mentioned in answers so far.
So if direct comparing of types is not a good idea - lets try built-in python function for this purpose, namely - isinstance()
.
It fails just in the beginning, because assumes that we have some objects, but pd.Series
or pd.DataFrame
may be used as just empty containers with predefined dtype
but no objects in it:
s = pd.Series([], dtype=bool)
s
>>> Series([], dtype: bool)
But if one somehow overcome this issue, and wants to access each object, for example, in the first row and checks its dtype like something like that:
df = pd.DataFrame({'int': [12, 2], 'dt': [pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')]},
index = ['A', 'B'])
for col in df.columns:
df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)
>>> (dtype('int64'), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')
It will be misleading in the case of mixed type of data in single column:
df2 = pd.DataFrame({'data': [12, pd.Timestamp('2013-01-02')]},
index = ['A', 'B'])
for col in df2.columns:
df2[col].dtype, 'is_int64 = %s' % isinstance(df2.loc['A', col], np.int64)
>>> (dtype('O'), 'is_int64 = False')
And last but not least - this method cannot directly recognize Category
dtype. As stated in docs:
Returning a single item from categorical data will also return the value, not a categorical of length “1”.
df['int'] = df['int'].astype('category')
for col in df.columns:
df[col].dtype, 'is_int64 = %s' % isinstance(df.loc['A', col], np.int64)
>>> (CategoricalDtype(categories=[2, 12], ordered=False), 'is_int64 = True')
>>> (dtype('<M8[ns]'), 'is_int64 = False')
So this method is also almost inapplicable.
3. df.dtype.kind
approach.
This method yet may work with empty pd.Series
or pd.DataFrames
but has another problems.
First - it is unable to differ some dtypes:
df = pd.DataFrame({'prd' :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
'str' :['s1', 's2'],
'cat' :[1, -1]})
df['cat'] = df['cat'].astype('category')
for col in df:
# kind will define all columns as 'Object'
print (df[col].dtype, df[col].dtype.kind)
>>> period[D] O
>>> object O
>>> category O
Second, what is actually still unclear for me, it even returns on some dtypes None.
4. df.select_dtypes
approach.
This is almost what we want. This method designed inside pandas so it handles most corner cases mentioned earlier - empty DataFrames, differs numpy or pandas-specific dtypes well. It works well with single dtype like .select_dtypes('bool')
. It may be used even for selecting groups of columns based on dtype:
test = pd.DataFrame({'bool' :[False, True], 'int64':[-1,2], 'int32':[-1,2],'float': [-2.5, 3.4],
'compl':np.array([1-1j, 5]),
'dt' :[pd.Timestamp('2013-01-02'), pd.Timestamp('2016-10-20')],
'td' :[pd.Timestamp('2012-03-02')- pd.Timestamp('2016-10-20'),
pd.Timestamp('2010-07-12')- pd.Timestamp('2000-11-10')],
'prd' :[pd.Period('2002-03','D'), pd.Period('2012-02-01', 'D')],
'intrv':pd.arrays.IntervalArray([pd.Interval(0, 0.1), pd.Interval(1, 5)]),
'str' :['s1', 's2'],
'cat' :[1, -1],
'obj' :[[1,2,3], [5435,35,-52,14]]
})
test['int32'] = test['int32'].astype(np.int32)
test['cat'] = test['cat'].astype('category')
Like so, as stated in the docs:
test.select_dtypes('number')
>>> int64 int32 float compl td
>>> 0 -1 -1 -2.5 (1-1j) -1693 days
>>> 1 2 2 3.4 (5+0j) 3531 days
On may think that here we see first unexpected (at used to be for me: question) results - TimeDelta
is included into output DataFrame
. But as answered in contrary it should be so, but one have to be aware of it. Note that bool
dtype is skipped, that may be also undesired for someone, but it's due to bool
and number
are in different "subtrees" of numpy dtypes. In case with bool, we may use test.select_dtypes(['bool'])
here.
Next restriction of this method is that for current version of pandas (0.24.2), this code: test.select_dtypes('period')
will raise NotImplementedError
.
And another thing is that it's unable to differ strings from other objects:
test.select_dtypes('object')
>>> str obj
>>> 0 s1 [1, 2, 3]
>>> 1 s2 [5435, 35, -52, 14]
But this is, first - already mentioned in the docs. And second - is not the problem of this method, rather the way strings are stored in DataFrame
. But anyway this case have to have some post processing.
5. df.api.types.is_XXX_dtype
approach.
This one is intended to be most robust and native way to achieve dtype recognition (path of the module where functions resides says by itself) as i suppose. And it works almost perfectly, but still have at least one caveat and still have to somehow distinguish string columns.
Besides, this may be subjective, but this approach also has more 'human-understandable' number
dtypes group processing comparing with .select_dtypes('number')
:
for col in test.columns:
if pd.api.types.is_numeric_dtype(test[col]):
print (test[col].dtype)
>>> bool
>>> int64
>>> int32
>>> float64
>>> complex128
No timedelta
and bool
is included. Perfect.
My pipeline exploits exactly this functionality at this moment of time, plus a bit of post hand processing.
Output.
Hope I was able to argument the main point - that all discussed approaches may be used, but only pd.DataFrame.select_dtypes()
and pd.api.types.is_XXX_dtype
should be really considered as the applicable ones.
I know this is a bit of an old thread but with pandas 19.02, you can do:
df.select_dtypes(include=['float64']).apply(your_function)
df.select_dtypes(exclude=['string','object']).apply(your_other_function)
http://pandas.pydata.org/pandas-docs/version/0.19.2/generated/pandas.DataFrame.select_dtypes.html