label-encoder encoding missing values
Don't use LabelEncoder
with missing values. I don't know which version of scikit-learn
you're using, but in 0.17.1 your code raises TypeError: unorderable types: str() > float()
.
As you can see in the source it uses numpy.unique
against the data to encode, which raises TypeError
if missing values are found. If you want to encode missing values, first change its type to a string:
a[pd.isnull(a)] = 'NaN'
you can also use a mask to replace form the original data frame after labelling
df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
A B C
0 x 1 2.0
1 NaN 6 1.0
2 z 9 NaN
original = df
mask = df_1.isnull()
A B C
0 False False False
1 True False False
2 False False True
df = df.astype(str).apply(LabelEncoder().fit_transform)
df.where(~mask, original)
A B C
0 1.0 0 1.0
1 NaN 1 0.0
2 2.0 2 NaN