sklearn - how to incorporate missing data when one-hot encoding
If you have pandas, this is pretty simple.
s = pd.Series(['A', 'A', 0, 'B', 0, 'A', np.nan])
s
0 A
1 A
2 0
3 B
4 0
5 A
6 NaN
dtype: object
Use replace
to convert 0
to NaN -
s = s.replace({0 : np.nan, '0' : np.nan})
s
0 A
1 A
2 NaN
3 B
4 NaN
5 A
6 NaN
dtype: object
Now, call pd.get_dummies
, which ignores NaN values.
pd.get_dummies(s)
A B
0 1 0
1 1 0
2 0 0
3 0 1
4 0 0
5 1 0
6 0 0
The solution is the same for a dataframe.