One hot encoding train with values not present on test
pd.get_dummies
should name the new columns in a way that allows you to tell which ones go with each categorical features. If you want to give it a custom set of prefixes to use, you can use the prefix
argument. Then, you can look at the list of columns to see all the columns corresponding to each feature. (You don't need prefix_sep='_'
, that is the default.)
df = pd.get_dummies(df, prefix=['first_feature', 'second_feature', 'third_feature']
first_feature_column_names = [c for c in df.columns if c.startswith('first_feature_')]
You can also perform the one-hot encoding for one categorical feature at a time, if that will help you know what columns are for each feature.
df = pd.get_dummies(df, columns=['first_feature'])
As for your issue with some labels only being present in your test set or your training set: If df
contains your training and test sets together (and you intend to separate them later with something like sklearn.model_selection.train_test_split
), then any feature that exists only in your test set will have an all-zeroes column in your training set. Obviously this won't actually provide any value to your model, but it will keep your column indexes consistent. However, there's really no point in having one-hot columns where none of your training data has a non-zero value in that feature - it will have no effect on your model. You can avoid errors and inconsistent column indexes between training and test using sklearn.preprocessing.OneHotEncoder
.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
ct = ColumnTransformer([
("onehot", OneHotEncoder(handle_unknown='ignore'), ['first_feature', 'second_feature', 'third_feature']),
], remainder='passthrough')
df_train = ct.fit_transform(df_train)
df_test = ct.transform(df_test)
# Or simply
df = ct.fit_transform(df)
handle_unknown
tells it to ignore (rather than throw an error for) any value that was not present in the initial training set.
Instead of using pd.get_dummies
, which has the drawbacks you identified, use sklearn.preprocessing.OneHotEncoder
. It automatically fetches all nominal categories from your train data and then encodes your test data according to the categories identified in the training step. If there are new categories in the test data, it will just encode your data as 0's.
Example:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
x_train = np.array([["A1","B1","C1"],["A2","B1","C2"]])
x_test = np.array([["A1","B2","C2"]]) # As you can see, "B2" is a new attribute for column B
ohe = OneHotEncoder(handle_unknown = 'ignore') #ignore tells the encoder to ignore new categories by encoding them with 0's
ohe.fit(x_train)
print(ohe.transform(x_train).toarray())
>>> array([[1., 0., 1., 1., 0.],
[0., 1., 1., 0., 1.]])
To get a summary of the categories by column in the train set, do:
print(ohe.categories_)
>>> [array(['A1', 'A2'], dtype='<U2'),
array(['B1'], dtype='<U2'),
array(['C1', 'C2'], dtype='<U2')]
To map one hot encoded columns to categories, do:
print(ohe.get_feature_names())
>>> ['x0_A1' 'x0_A2' 'x1_B1' 'x2_C1' 'x2_C2']
Finally, this is how the encoder works on new test data:
print(ohe.transform(x_test).toarray())
>>> [[1. 0. 0. 0. 1.]] # 1 for A1, 0 for A2, 0 for B1, 0 for C1, 1 for C2
EDIT:
You seem to be worried about the fact that you lose the labels after doing the encoding. It is actually very easy to get back to these, just wrap the answer in a dataframe and specify the column names from ohe.get_feature_names()
:
pd.DataFrame(ohe.transform(x_test).toarray(), columns = ohe.get_feature_names())