Reversing 'one-hot' encoding in Pandas
I'd do:
cols = df.columns.to_series().values
pd.DataFrame(np.repeat(cols[None, :], len(df), 0)[df.astype(bool).values], df.index[df.any(1)])
Timing
MaxU's method has edge for large dataframes
Small df
5 x 3
Large df
1000000 x 52
I would use apply to decode the columns:
In [2]: animals = pd.DataFrame({"monkey":[0,1,0,0,0],"rabbit":[1,0,0,0,0],"fox":[0,0,1,0,0]})
In [3]: def get_animal(row):
...: for c in animals.columns:
...: if row[c]==1:
...: return c
In [4]: animals.apply(get_animal, axis=1)
Out[4]:
0 rabbit
1 monkey
2 fox
3 None
4 None
dtype: object
This works with both single and multiple labels.
We can use advanced indexing to tackle this problem. Here is the link.
import pandas as pd
df = pd.DataFrame({"monkey":[1,1,0,1,0],"rabbit":[1,1,1,1,0],\
"fox":[1,0,1,0,0], "cat":[0,0,0,0,1]})
df['tags']='' # to create an empty column
for col_name in df.columns:
df.ix[df[col_name]==1,'tags']= df['tags']+' '+col_name
print df
And the result is:
cat fox monkey rabbit tags
0 0 1 1 1 fox monkey rabbit
1 0 0 1 1 monkey rabbit
2 0 1 0 1 fox rabbit
3 0 0 1 1 monkey rabbit
4 1 0 0 0 cat
Explanation: We iterate over the columns on the dataframe.
df.ix[selection criteria, columns to write value] = value
df.ix[df[col_name]==1,'tags']= df['tags']+' '+col_name
The above line basically finds you all the places where df[col_name] == 1, selects column 'tags' and set it to the RHS value which is df['tags']+' '+ col_name
Note: .ix
has been deprecated since Pandas v0.20. You should instead use .loc
or .iloc
, as appropriate.
UPDATE: i think ayhan is right and it should be:
df.idxmax(axis=1)
Demo:
In [40]: s = pd.Series(['dog', 'cat', 'dog', 'bird', 'fox', 'dog'])
In [41]: s
Out[41]:
0 dog
1 cat
2 dog
3 bird
4 fox
5 dog
dtype: object
In [42]: pd.get_dummies(s)
Out[42]:
bird cat dog fox
0 0.0 0.0 1.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 0.0 1.0 0.0
3 1.0 0.0 0.0 0.0
4 0.0 0.0 0.0 1.0
5 0.0 0.0 1.0 0.0
In [43]: pd.get_dummies(s).idxmax(1)
Out[43]:
0 dog
1 cat
2 dog
3 bird
4 fox
5 dog
dtype: object
OLD answer: (most probably, incorrect answer)
try this:
In [504]: df.idxmax().reset_index().rename(columns={'index':'animal', 0:'idx'})
Out[504]:
animal idx
0 fox 2
1 monkey 1
2 rabbit 0
data:
In [505]: df
Out[505]:
fox monkey rabbit
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 0
4 0 0 0