Reverse a get_dummies encoding in pandas
Pretty one-liner :)
new_df = df.idxmax(axis=1)
Several great answers for the OP post. However, often get_dummies
is used for multiple categorical features. Pandas uses a prefix separator prefix_sep
to distinguish different values for a column.
The following function collapses a "dummified" dataframe while keeping the order of columns:
def undummify(df, prefix_sep="_"):
cols2collapse = {
item.split(prefix_sep)[0]: (prefix_sep in item) for item in df.columns
}
series_list = []
for col, needs_to_collapse in cols2collapse.items():
if needs_to_collapse:
undummified = (
df.filter(like=col)
.idxmax(axis=1)
.apply(lambda x: x.split(prefix_sep, maxsplit=1)[1])
.rename(col)
)
series_list.append(undummified)
else:
series_list.append(df[col])
undummified_df = pd.concat(series_list, axis=1)
return undummified_df
Example
>>> df
a b c
0 A_1 B_1 C_1
1 A_2 B_2 C_2
>>> df2 = pd.get_dummies(df)
>>> df2
a_A_1 a_A_2 b_B_1 b_B_2 c_C_1 c_C_2
0 1 0 1 0 1 0
1 0 1 0 1 0 1
>>> df3 = undummify(df2)
>>> df3
a b c
0 A_1 B_1 C_1
1 A_2 B_2 C_2