Convert array of string (category) to array of int from a pandas dataframe
If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor
class (available in the pandas
namespace):
In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])
In [2]: s
Out[2]:
0 single
1 touching
2 nuclei
3 dusts
4 touching
5 single
6 nuclei
Name: None, Length: 7
In [4]: Factor(s)
Out[4]:
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]
The factor has attributes labels
and levels
:
In [7]: f = Factor(s)
In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)
In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)
This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.
BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.
I am answering the question for Pandas 0.10.1. Factor.from_array
seems to do the trick.
>>> s = pandas.Series(['a', 'b', 'a', 'c', 'a', 'b', 'a'])
>>> s
0 a
1 b
2 a
3 c
4 a
5 b
6 a
>>> f = pandas.Factor.from_array(s)
>>> f
Categorical:
array([a, b, a, c, a, b, a], dtype=object)
Levels (3): Index([a, b, c], dtype=object)
>>> f.labels
array([0, 1, 0, 2, 0, 1, 0])
>>> f.levels
Index([a, b, c], dtype=object)
The previous answers are outdated, so here is a solution for mapping strings to numbers that works with version 0.18.1 of Pandas.
For a Series:
In [1]: import pandas as pd
In [2]: s = pd.Series(['single', 'touching', 'nuclei', 'dusts',
'touching', 'single', 'nuclei'])
In [3]: s_enc = pd.factorize(s)
In [4]: s_enc[0]
Out[4]: array([0, 1, 2, 3, 1, 0, 2])
In [5]: s_enc[1]
Out[5]: Index([u'single', u'touching', u'nuclei', u'dusts'], dtype='object')
For a DataFrame:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'labels': ['single', 'touching', 'nuclei',
'dusts', 'touching', 'single', 'nuclei']})
In [3]: catenc = pd.factorize(df['labels'])
In [4]: catenc
Out[4]: (array([0, 1, 2, 3, 1, 0, 2]),
Index([u'single', u'touching', u'nuclei', u'dusts'],
dtype='object'))
In [5]: df['labels_enc'] = catenc[0]
In [6]: df
Out[4]:
labels labels_enc
0 single 0
1 touching 1
2 nuclei 2
3 dusts 3
4 touching 1
5 single 0
6 nuclei 2