Issue with OneHotEncoder for categorical features
If you read the docs for OneHotEncoder
you'll see the input for fit
is "Input array of type int". So you need to do two steps for your one hot encoded data
from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.LabelEncoder()
enc.fit(cat_features)
new_cat_features = enc.transform(cat_features)
print new_cat_features # [1 2 0]
new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape
ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read
print ohe.fit_transform(new_cat_features)
Output:
[[ 0. 1. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]]
EDIT
As of 0.20
this became a bit easier, not only because OneHotEncoder
now handles strings nicely, but also because we can transform multiple columns easily using ColumnTransformer
, see below for an example
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
X = np.array([['apple', 'red', 1, 'round', 0],
['orange', 'orange', 2, 'round', 0.1],
['bannana', 'yellow', 2, 'long', 0],
['apple', 'green', 1, 'round', 0.2]])
ct = ColumnTransformer(
[('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),], # the column numbers I want to apply this to
remainder='passthrough' # This leaves the rest of my columns in place
)
print(ct2.fit_transform(X)) # Notice the output is a string
Output:
[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0']
['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1']
['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0']
['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]
If the dataset is in pandas data frame, using
pandas.get_dummies
will be more straightforward.
*corrected from pandas.get_getdummies to pandas.get_dummies
from the documentation:
categorical_features : “all” or array of indices or mask
Specify what features are treated as categorical.
‘all’ (default): All features are treated as categorical.
array of indices: Array of categorical feature indices.
mask: Array of length n_features and with dtype=bool.
column names of pandas dataframe won't work. if you categorical features are column numbers 0, 2 and 6 use :
from sklearn import preprocessing
cat_features = [0, 2, 6]
enc = preprocessing.OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)
It must also be noted that if these categorical features are not label encoded, you need to use LabelEncoder
on these features before using OneHotEncoder
You can apply both transformations (from text categories to integer categories, then from integer categories to one-hot vectors) in one shot using the LabelBinarizer class:
cat_features = ['color', 'director_name', 'actor_2_name']
encoder = LabelBinarizer()
new_cat_features = encoder.fit_transform(cat_features)
new_cat_features
Note that this returns a dense NumPy array by default. You can get a sparse matrix instead by passing sparse_output=True to the LabelBinarizer constructor.
Source Hands-On Machine Learning with Scikit-Learn and TensorFlow