LabelEncoder order of fit for a Pandas df
It's done in sort order. In the case of strings, it is done in alphabetic order. There's no documentation for this, but looking at the source code for LabelEncoder.transform we can see the work is mostly delegated to the function numpy.setdiff1d, with the following documentation:
Find the set difference of two arrays.
Return the sorted, unique values in ar1 that are not in ar2.
(Emphasis mine).
Note that since this is not documented, it is probably implementation defined and can be changed between versions. It could be that just the version I looked use the sort order, and other versions of scikit-learn may change this behavior (by not using numpy.setdiff1d).
I was also a bit surprised that I cannot provide an order to LabelEncoder
. A one line solution can be like this:
df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third', 'fourth'].index(x))