Reconstruct a categorical variable from dummies in pandas
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories()
, see here
It's been a few years, so this may well not have been in the pandas
toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax
will return the index corresponding to the largest element (i.e. the one with a 1
). We do axis=1
because we want the column name where the 1
occurs.
EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as @Jeff did by wrapping it with pd.Categorical
(and pd.Series
, if desired).
In [1]: import pandas as pd
In [2]: s = pd.Series(['a', 'b', 'a', 'c'])
In [3]: s
Out[3]:
0 a
1 b
2 a
3 c
dtype: object
In [4]: dummies = pd.get_dummies(s)
In [5]: dummies
Out[5]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
In [6]: s2 = dummies.idxmax(axis=1)
In [7]: s2
Out[7]:
0 a
1 b
2 a
3 c
dtype: object
In [8]: (s2 == s).all()
Out[8]: True
EDIT in response to @piRSquared's comment:
This solution does indeed assume there's one 1
per row. I think this is usually the format one has. pd.get_dummies
can return rows that are all 0 if you have drop_first=True
or if there are NaN
values and dummy_na=False
(default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a
in the example above).
If drop_first=True
, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False
(default).
Since dummy_na=False
is the default, this could certainly cause problems. Please set dummy_na=True
when you call pd.get_dummies
if you want to use this solution to invert the "dummification" and your data contains any NaNs
. Setting dummy_na=True
will always add a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaN
s. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any())
. What's also nice is that idxmax
solution will correctly regenerate your NaN
s (not just a string that says "nan").
It's also worth mentioning that setting drop_first=True
and dummy_na=False
means that NaN
s become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN
values.