Dummify categorical variables for logistic regression with pandas and scikit (OneHotEncoder)
I will try and answer all your questions individually.
Answer for Question 1
In your code you have used fit_transform
method both on your train and test data which is not the correct way of doing it. Generally, fit_transform
is applied only on your train data set, and it returns a transformer which is then just used to transform
your test data set. When you apply fit_transform
on your test data, you just transform your test data with just the options/levels of the categorical variables available only in your test data set and it is very much possible that your test data may not contain all options/levels of all categorical variables, due to which the dimension of your train and test data set will differ resulting in the error which you have got.
So the correct way of doing it would be:
X_train_t = ct.fit_transform(X_train)
X_test_t = ct.transform(X_test)
Answer for Question 2
If you want to avoid the "dummy variable trap" you can make use of the parameter drop
(by setting it to first
) while creating the OneHotEncoder
object in the ColumnTransformer
, this will result in creating just one column for sex
and two columns for Embarked
since they have two and three options/levels respectively.
So the correct way of doing it would be:
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), ['Sex','Embarked'])], remainder='passthrough')
Answer for Question 3
As of now the get_feature_names
method which can be reconstruct your data frame with new dummy columns is not implemented insklearn
yet. One work around for this would be to change the reminder
to drop
in the ColumnTransformer
construction and construct your data frame separately as shown below:
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), ['Sex', 'Embarked'])], remainder='drop')
A = pd.concat([X_train.drop(["Sex", "Embarked"], axis=1), pd.DataFrame(X_train_t, columns=ct.get_feature_names())], axis=1)
A.head()
which will result in something like this:
Your final code will look like this:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_df = pd.read_csv('train.csv', usecols=cols)
test_df = pd.read_csv('test.csv', usecols=[e for e in cols if e != 'Survived'])
cols = ['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
train_df = train_df.dropna()
test_df = test_df.dropna()
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.copy()
categorical_values = ['Sex', 'Embarked']
X_train_cont = X_train.drop(categorical_values, axis=1)
X_test_cont = X_test.drop(categorical_values, axis=1)
ct = ColumnTransformer([("onehot", OneHotEncoder(sparse=False, drop="first"), categorical_values)], remainder='drop')
X_train_categorical = ct.fit_transform(X_train)
X_test_categorical = ct.transform(X_test)
X_train_t = pd.concat([X_train_cont, pd.DataFrame(X_train_categorical, columns=ct.get_feature_names())], axis=1)
X_test_t = pd.concat([X_test_cont, pd.DataFrame(X_test_categorical, columns=ct.get_feature_names())], axis=1)
logreg = LogisticRegression(max_iter=5000)
logreg.fit(X_train_t, Y_train)
Y_pred = logreg.predict(X_test_t)
acc_log = round(logreg.score(X_train_t, Y_train) * 100, 2)
print(acc_log)
80.34
And when you do X_train_t.head()
you get
Hope this helps!
- Recommended practice is suggested in @Parthasarathy Subburaj's answer but I have seen in Kaggle or other competition, where people fit on the complete data (train+test). If you want to try the same, use the following format
ct.fit(X_complete)
X_train_t, X_test_t = ct.transform(X_test), ct.transform(X_test)
ya, use
drop='first'
to get over this issue. At the same time, remember this multicollinearity problem is not a big deal for non-linear models such as neural networks or even decision trees. I believe that is the reason why it is not kept as the default arg param value.get_feature_names
is not implemented exhaustively for pipelines and other stuffs in sklearn. Hence, they are supporting complete inColumnTransformer
as well.
Based on my experience, I had built this wrapper for ColumnTransfomer
, which can support for even it has pipelines
or reminder=passthrough
.
This also picks up the feature names for get_feature_names
instead of calling it as x0, x1
because we know the actual column names inside ColumnTransformer
using _feature_names_in
.
from sklearn.compose import ColumnTransformer
from sklearn.utils.validation import check_is_fitted
def _get_features_out(name, trans, features_in):
if hasattr(trans, 'get_feature_names'):
return [name + "__" + f for f in
trans.get_feature_names(features_in)]
else:
return features_in
class NamedColumnTransformer(ColumnTransformer):
def get_feature_names(self):
check_is_fitted(self)
feature_names = []
for name, trans, features, _ in self._iter(fitted=True):
if trans == 'drop':
continue
if trans == 'passthrough':
feature_names.extend(self._feature_names_in[features])
elif hasattr(trans, '_iter'):
for _, op_name, t in trans._iter():
features=_get_features_out(op_name, t, features)
feature_names.extend(features)
elif not hasattr(trans, 'get_feature_names'):
raise AttributeError("Transformer %s (type %s) does not "
"provide get_feature_names."
% (str(name), type(trans).__name__))
else:
feature_names.extend(_get_features_out(name, trans, features))
return feature_names
Now, for your example,
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
# you can fetch the titanic dataset using this
X, y = fetch_openml("titanic", version=1,
as_frame=True, return_X_y=True)
# removing the columns which you are not using
X.drop(['name', 'ticket', 'cabin', 'boat', 'body', 'home.dest'],
axis=1, inplace=True)
X.dropna(inplace=True)
X.reset_index(drop=True, inplace=True)
y = y[X.index]
categorical_values = ['sex', 'embarked']
ct = NamedColumnTransformer([("onehot", OneHotEncoder(
sparse=False, drop="first"), categorical_values)], remainder='passthrough')
clf = Pipeline(steps=[('preprocessor', ct),
('classifier', LogisticRegression(max_iter=5000))])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf.fit(X_train, y_train)
clf[0].get_feature_names()
# ['onehot__sex_male',
# 'onehot__embarked_Q',
# 'onehot__embarked_S',
# 'pclass',
# 'age',
# 'sibsp',
# 'parch',
# 'fare']
pd.DataFrame(clf[0].transform(X_train), columns=clf[0].get_feature_names())
You can also try the NamedColumnTransformer
for a more interesting example of ColumnTransformer
here.