How to use OneHotEncoder for multiple columns and automatically drop first dummy variable for each column?
import pandas as pd
df = pd.DataFrame({'name': ['Manie', 'Joyce', 'Ami'],
'Org': ['ABC2', 'ABC1', 'NSV2'],
'Dept': ['Finance', 'HR', 'HR']
})
df_2 = pd.get_dummies(df,drop_first=True)
test:
print(df_2)
Dept_HR Org_ABC2 Org_NSV2 name_Joyce name_Manie
0 0 1 0 0 1
1 1 0 0 1 0
2 1 0 1 0 0
UPDATE regarding your error with pd.get_dummies(X, columns =[1:]
:
Per the documentation page, the columns
parameter takes "Column Names". So the following code would work:
df_2 = pd.get_dummies(df, columns=['Org', 'Dept'], drop_first=True)
output:
name Org_ABC2 Org_NSV2 Dept_HR
0 Manie 1 0 0
1 Joyce 0 0 1
2 Ami 0 1 1
If you really want to define your columns positionally, you could do it this way:
column_names_for_onehot = df.columns[1:]
df_2 = pd.get_dummies(df, columns=column_names_for_onehot, drop_first=True)
I use my own template for doing that:
from sklearn.base import TransformerMixin
import pandas as pd
import numpy as np
class DataFrameEncoder(TransformerMixin):
def __init__(self):
"""Encode the data.
Columns of data type object are appended in the list. After
appending Each Column of type object are taken dummies and
successively removed and two Dataframes are concated again.
"""
def fit(self, X, y=None):
self.object_col = []
for col in X.columns:
if(X[col].dtype == np.dtype('O')):
self.object_col.append(col)
return self
def transform(self, X, y=None):
dummy_df = pd.get_dummies(X[self.object_col],drop_first=True)
X = X.drop(X[self.object_col],axis=1)
X = pd.concat([dummy_df,X],axis=1)
return X
And for using this code just put this template in current directory with filename let's suppose CustomeEncoder.py and type in your code:
from customEncoder import DataFrameEncoder
data = DataFrameEncoder().fit_transormer(data)
And all the object type data removed, Encoded, removed first and joined together to give the final desired output.
PS: That the input file to this template is Pandas Dataframe.
It is quite simple in scikit-learn version starting from 0.21. One can use the drop parameter in OneHotEncoder and use it to drop one of the categories per feature. By default, it won't drop. Details can be found in documentation.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder
//drops the first category in each feature
ohe = OneHotEncoder(drop='first', handle_unknown='error')