Transforming multilabels to single label problem
You could try this, to get the desired output based on your original approach:
Option 1
temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)
df=df.explode('y').fillna(0).reset_index(drop=True)
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.y.values[int(x.name)] ,axis=1).astype(int)
df.loc[1:, 'a':'d']=m.astype(int)
Another approach, similar to @ALollz's solution:
Option 2
df=df.assign(y=[np.array(range(i))+1 for i in df.loc[:, 'a':'d'].sum(axis=1)]).explode('y').fillna(1)
m = df.loc[:, 'a':'d'].groupby(level=0).cumsum(1).eq(df.y, axis=0)
df.loc[:, 'a':'d'] = df.loc[:, 'a':'d'].where(m).fillna(0).astype(int)
df['y']=df.loc[:, 'a':'d'].dot(df.columns[list(df.columns).index('a'):list(df.columns).index('d')+1]).replace('',0)
Output:
df
x1 x2 a b c d y
0 1 2 0 0 0 0 0
1 2 -7 1 0 0 0 a
1 2 -7 0 1 0 0 b
1 2 -7 0 0 1 0 c
2 3 4 0 1 0 0 b
2 3 4 0 0 1 0 c
2 3 4 0 0 0 1 d
3 4 3 1 0 0 0 a
3 4 3 0 0 1 0 c
4 5 2 1 0 0 0 a
4 5 2 0 0 1 0 c
4 5 2 0 0 0 1 d
Explanation of Option 1:
First, we use your approach, but instead of change the original data, use copy temp
, and also instead of joining the columns into a string, keep them as a list:
temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1) #without join
df['y']
0 []
1 [a, b, c]
2 [b, c, d]
3 [a, c]
4 [a, c, d]
Then we can use pd.DataFrame.explode
to get the lists expanded, pd.DataFrame.fillna(0)
to fill the first row, and pd.DataFrame.reset_index()
:
df=df.explode('y').fillna(0).reset_index(drop=True)
df
x1 x2 a b c d y
0 1 2 0 0 0 0 0
1 2 -7 1 1 1 0 a
2 2 -7 1 1 1 0 b
3 2 -7 1 1 1 0 c
4 3 4 0 1 1 1 b
5 3 4 0 1 1 1 c
6 3 4 0 1 1 1 d
7 4 3 1 0 1 0 a
8 4 3 1 0 1 0 c
9 5 2 1 0 1 1 a
10 5 2 1 0 1 1 c
11 5 2 1 0 1 1 d
Then we mask df.loc[1:, 'a':'d']
to see when it is equal to y
column, and then, we cast the mask to int, using astype(int)
:
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
m
a b c d
1 True False False False
2 False True False False
3 False False True False
4 False True False False
5 False False True False
6 False False False True
7 True False False False
8 False False True False
9 True False False False
10 False False True False
11 False False False True
df.loc[1:, 'a':'d']=m.astype(int)
df.loc[1:, 'a':'d']
a b c d
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
7 1 0 0 0
8 0 0 1 0
9 1 0 0 0
10 0 0 1 0
11 0 0 0 1
Important: Note that in the last step we are excluding first row in this case, because it will be True all value in row in the mask, since all values are 0, for a general way you could try this:
#Replace NaN values (the empty list from original df) with ''
df=df.explode('y').fillna('').reset_index(drop=True)
#make the mask with all the rows
msk=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
df.loc[:, 'a':'d']=msk.astype(int)
#Then, replace the original '' (NaN values) with 0
df=df.replace('',0)