numpy convert categorical string arrays to an integer array
... years later....
For completeness (because this isn't mentioned in the answers) and personal reasons (I always have pandas
imported in my modules but not necessarily sklearn
), this is also quite straightforward with pandas.get_dummies()
import numpy as np
import pandas
In [1]: a = np.array(['a', 'b', 'c', 'a', 'b', 'c'])
In [2]: b = pandas.get_dummies(a)
In [3]: b
Out[3]:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 0 1 0
5 0 0 1
In [3]: b.values.argmax(1)
Out[4]: array([0, 1, 2, 0, 1, 2])
np.unique has some optional returns
return_inverse gives the integer encoding, which I use very often
>>> b, c = np.unique(a, return_inverse=True)
>>> b
array(['a', 'b', 'c'],
dtype='|S1')
>>> c
array([0, 1, 2, 0, 1, 2])
>>> c+1
array([1, 2, 3, 1, 2, 3])
it can be used to recreate the original array from uniques
>>> b[c]
array(['a', 'b', 'c', 'a', 'b', 'c'],
dtype='|S1')
>>> (b[c] == a).all()
True