label-encoder encoding missing values

Don't use LabelEncoder with missing values. I don't know which version of scikit-learn you're using, but in 0.17.1 your code raises TypeError: unorderable types: str() > float().

As you can see in the source it uses numpy.unique against the data to encode, which raises TypeError if missing values are found. If you want to encode missing values, first change its type to a string:

a[pd.isnull(a)]  = 'NaN'

you can also use a mask to replace form the original data frame after labelling

df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})

    A   B   C
0   x   1   2.0
1   NaN 6   1.0
2   z   9   NaN

original = df
mask = df_1.isnull()
       A    B   C
0   False   False   False
1   True    False   False
2   False   False   True

df = df.astype(str).apply(LabelEncoder().fit_transform)
df.where(~mask, original)

A   B   C
0   1.0 0   1.0
1   NaN 1   0.0
2   2.0 2   NaN

label-encoder encoding missing values

Tags:

Python

Pandas

Scikit Learn

Related

Recent Posts