How to give column names after one-hot encoding with sklearn?
This example could help for future readers:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})
>>>
Sex AgeGroup
0 male 0
1 female 15
2 male 30
3 female 45
4 male 60
5 female 75
encoder=OneHotEncoder(sparse=False)
train_X_encoded = pd.DataFrame (encoder.fit_transform(train_X[['Sex']]))
train_X_encoded.columns = encoder.get_feature_names(['Sex'])
train_X.drop(['Sex'] ,axis=1, inplace=True)
OH_X_train= pd.concat([train_X, train_X_encoded ], axis=1)
>>>
AgeGroup Sex_female Sex_male
0 0 0.0 1.0
1 15 1.0 0.0
2 30 0.0 1.0
3 45 1.0 0.0
4 60 0.0 1.0
5 75 1.0 0.0`
Hey I had the same problem whereby I had a custom Estimator which extended the BaseEstimator Class from Sklearn.base
I added a class attribute into the init called self.feature_names then as a last step in the transform method just updated self.feature_names with the columns from the result.
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd
class CustomOneHotEncoder(BaseEstimator, TransformerMixin):
def __init__(self, **kwargs):
self.feature_names = []
def fit(self, X, y=None):
return self
def transform(self, X):
result = pd.get_dummies(X)
self.feature_names = result.columns
return result
A bit basic I know but it does the job I need it to.
If you want to retrieve the column names for the feature importances from your sklearn pipeline you can get the features from the classifier step and the column names from the one hot encoding step.
a = model.best_estimator_.named_steps["clf"].feature_importances_
b = model.best_estimator_.named_steps["ohc"].feature_names
df = pd.DataFrame(a,b)
df.sort_values(by=[0], ascending=False).head(20)
You can get the column names using .get_feature_names()
attribute.
>>> ohenc.get_feature_names()
>>> x_cat_df.columns = ohenc.get_feature_names()
Detailed example is here.
Update
from Version 1.0, use get_feature_names_out