Feature Hashing on multiple categorical features(columns)
Hashing (Update)
Assuming that new categories might show up in some of the features, hashing is the way to go. Just 2 notes:
- Be aware of the possibility of collision and adjust the number of features accordingly
- In your case, you want to hash each feature separately
One Hot Vector
In case the number of categories for each feature is fixed and not too large, use one hot encoding.
I would recommend using either of the two:
sklearn.preprocessing.OneHotEncoder
pandas.get_dummies
Example
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({'feature_1': ['A', 'G', 'T', 'A'],
'feature_2': ['cat', 'dog', 'elephant', 'zebra']})
# Approach 0 (Hashing per feature)
n_orig_features = df.shape[1]
hash_vector_size = 6
ct = ColumnTransformer([(f't_{i}', FeatureHasher(n_features=hash_vector_size,
input_type='string'), i) for i in range(n_orig_features)])
res_0 = ct.fit_transform(df) # res_0.shape[1] = n_orig_features * hash_vector_size
# Approach 1 (OHV)
res_1 = pd.get_dummies(df)
# Approach 2 (OHV)
res_2 = OneHotEncoder(sparse=False).fit_transform(df)
res_0
:
array([[ 0., 0., 0., 0., 1., 0., 0., 0., 1., -1., 0., -1.],
[ 0., 0., 0., 1., 0., 0., 0., 2., -1., 0., 0., 0.],
[ 0., -1., 0., 0., 0., 0., -2., 2., 2., -1., 0., -1.],
[ 0., 0., 0., 0., 1., 0., 0., 2., 1., -1., 0., -1.]])
res_1
:
feature_1_A feature_1_G feature_1_T feature_2_cat feature_2_dog feature_2_elephant feature_2_zebra
0 1 0 0 1 0 0 0
1 0 1 0 0 1 0 0
2 0 0 1 0 0 1 0
3 1 0 0 0 0 0 1
res_2
:
array([[1., 0., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 1., 0.],
[1., 0., 0., 0., 0., 0., 1.]])