Transforming multilabels to single label problem

You could try this, to get the desired output based on your original approach:

Option 1

temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)
df=df.explode('y').fillna(0).reset_index(drop=True)
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.y.values[int(x.name)] ,axis=1).astype(int)
df.loc[1:, 'a':'d']=m.astype(int)

Another approach, similar to @ALollz's solution:

Option 2

df=df.assign(y=[np.array(range(i))+1 for i in df.loc[:, 'a':'d'].sum(axis=1)]).explode('y').fillna(1)
m = df.loc[:, 'a':'d'].groupby(level=0).cumsum(1).eq(df.y, axis=0) 
df.loc[:, 'a':'d'] = df.loc[:, 'a':'d'].where(m).fillna(0).astype(int)
df['y']=df.loc[:, 'a':'d'].dot(df.columns[list(df.columns).index('a'):list(df.columns).index('d')+1]).replace('',0)

Output:

df
  x1  x2  a  b  c  d  y
0   1   2  0  0  0  0  0
1   2  -7  1  0  0  0  a
1   2  -7  0  1  0  0  b
1   2  -7  0  0  1  0  c
2   3   4  0  1  0  0  b
2   3   4  0  0  1  0  c
2   3   4  0  0  0  1  d
3   4   3  1  0  0  0  a
3   4   3  0  0  1  0  c
4   5   2  1  0  0  0  a
4   5   2  0  0  1  0  c
4   5   2  0  0  0  1  d

Explanation of Option 1:

First, we use your approach, but instead of change the original data, use copy temp, and also instead of joining the columns into a string, keep them as a list:

temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)   #without join

df['y']
0           []
1    [a, b, c]
2    [b, c, d]
3       [a, c]
4    [a, c, d]

Then we can use pd.DataFrame.explode to get the lists expanded, pd.DataFrame.fillna(0) to fill the first row, and pd.DataFrame.reset_index():

df=df.explode('y').fillna(0).reset_index(drop=True)

df
    x1  x2  a  b  c  d            y
0    1   2  0  0  0  0            0
1    2  -7  1  1  1  0            a
2    2  -7  1  1  1  0            b
3    2  -7  1  1  1  0            c
4    3   4  0  1  1  1            b
5    3   4  0  1  1  1            c
6    3   4  0  1  1  1            d
7    4   3  1  0  1  0            a
8    4   3  1  0  1  0            c
9    5   2  1  0  1  1            a
10   5   2  1  0  1  1            c
11   5   2  1  0  1  1            d

Then we mask df.loc[1:, 'a':'d'] to see when it is equal to y column, and then, we cast the mask to int, using astype(int):

m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)

m
        a      b      c      d
1    True  False  False  False
2   False   True  False  False
3   False  False   True  False
4   False   True  False  False
5   False  False   True  False
6   False  False  False   True
7    True  False  False  False
8   False  False   True  False
9    True  False  False  False
10  False  False   True  False
11  False  False  False   True



df.loc[1:, 'a':'d']=m.astype(int)

df.loc[1:, 'a':'d']
   a  b  c  d
1   1  0  0  0
2   0  1  0  0
3   0  0  1  0
4   0  1  0  0
5   0  0  1  0
6   0  0  0  1
7   1  0  0  0
8   0  0  1  0
9   1  0  0  0
10  0  0  1  0
11  0  0  0  1

Important: Note that in the last step we are excluding first row in this case, because it will be True all value in row in the mask, since all values are 0, for a general way you could try this:

#Replace NaN values (the empty list from original df) with ''
df=df.explode('y').fillna('').reset_index(drop=True)

#make the mask with all the rows
msk=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
df.loc[:, 'a':'d']=msk.astype(int)

#Then, replace the original '' (NaN values) with 0
df=df.replace('',0)

Transforming multilabels to single label problem

Tags:

Python

Pandas

Related

Recent Posts