pandas: Group by splitting string value in all rows (a column) and aggregation function

Solutions working with lists in column person_name:

#if necessary
#df['person_name'] = df['person_name'].str.strip('[]').str.split(', ')

print (type(df.loc[0, 'person_name']))
<class 'list'>

First idea is use defaultdict for store sumed values in loop:

from collections import defaultdict

d = defaultdict(int)
for p, s in zip(df['person_name'], df['salary']):
    for x in p:
        d[x] += int(s)

print (d)
defaultdict(<class 'int'>, {'alexander': 171000, 
                            'william': 125000, 
                            'smith': 110000, 
                            'robert': 145000, 
                            'gates': 135000, 
                            'bob': 56000})

And then:

df1 = pd.DataFrame({'group':list(d.keys()),
                    'sum_salary':list(d.values())})
print (df1)
       group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000

Another solution with repeating values by length of lists and aggregate sum:

from itertools import chain

df1 = pd.DataFrame({
    'group' : list(chain.from_iterable(df['person_name'].tolist())), 
    'sum_salary' : df['salary'].values.repeat(df['person_name'].str.len())
})

df2 = df1.groupby('group', as_index=False, sort=False)['sum_salary'].sum()
print (df2)
       group  sum_salary
0  alexander      171000
1    william      125000
2      smith      110000
3     robert      145000
4      gates      135000
5        bob       56000

Another sol:

df_new=(pd.DataFrame({'person_name':np.concatenate(df.person_name.values),
                  'salary':df.salary.repeat(df.person_name.str.len())}))
print(df_new.groupby('person_name')['salary'].sum().reset_index())


  person_name  salary
0   alexander  171000
1         bob   56000
2       gates  135000
3      robert  145000
4       smith  110000
5     william  125000

Can be done concisely with dummies though performance will suffer due to all of the .str methods:

df.person_name.str.join('*').str.get_dummies('*').multiply(df.salary, 0).sum()

#alexander    171000
#bob           56000
#gates        135000
#robert       145000
#smith        110000
#william      125000
#dtype: int64

pandas: Group by splitting string value in all rows (a column) and aggregation function

Tags:

Python

Pandas

Numpy

Related

Recent Posts