How could I detect subtypes in pandas object columns?
You should appreciate that with Pandas you can have 2 broad types of series:
- Optimised structures: Usually numeric data, this includes
np.datetime64
andbool
. object
dtype: Used for series with mixed types or types which cannot be held natively in a NumPy array. The series is structured as a sequence of pointers to arbitrary Python objects and is generally inefficient.
The reason for this preamble is you should only ever need to apply element-wise logic to the second type. Data in the first category is homogeneous by nature.
So you should separate your logic accordingly.
Regular dtypes
Use pd.DataFrame.dtypes
:
print(df.dtypes)
a int64
b float64
c object
dtype: object
object
dtype
Isolate these series via pd.DataFrame.select_dtypes
and then use a dictionary comprehension:
obj_types = {col: set(map(type, df[col])) for col in df.select_dtypes(include=[object])}
print(obj_types)
{'c': {int, datetime.time, float}}
You will need to do a little more work to get the exact format you require, but the above should be your plan of attack.
Just wanted to provide what I found to be a more readable version...
Load your packages and create the dataframe
# Packages
import pandas as pd
import datetime
# DataFrame
df = pd.DataFrame({'a': [100, 3,4], 'b': [20.1, 2.3,45.3], 'c': [datetime.time(23,52), 30,1.00]})
# Map over each column individually, within a print
print("column a =", df.a.map(type).unique())
print("column b =", df.b.map(type).unique())
print("column c =", df.c.map(type).unique())
# Outputs:
column a = [<class 'int'>]
column b = [<class 'float'>]
column c = [<class 'datetime.time'> <class 'int'> <class 'float'>]
Likely unnecessary (and a bit more complicated), but would help you remove the class
and < >
characters is the following...###
# Use `.__name__` within a list comprehension to access only the type name
print("column a =", [x.__name__ for x in df.a.map(type).unique()])
print("column b =", [x.__name__ for x in df.b.map(type).unique()])
print("column c =", [x.__name__ for x in df.c.map(type).unique()])
# Outputs:
column a = ['int']
column b = ['float']
column c = ['time', 'int', 'float']
While this is repetitive, and I know that repetition in code is often frowned upon, it is much simpler to understand if you were sharing this code with someone else (at least to me) and, thus, more valuable (again in my opinion).
You can just use python built-in function map.
column_c = list(map(type,df['c']))
print(column_c)
output:
[datetime.time, int, float]
types = {i: set(map(type, df[i])) for i in df.columns}
# this will return unique dtypes of all columns in a dict