Calculate percentage of similar values in pandas dataframe
You could divide by the sum
along the first axis and then cast to string and add %
:
out = (df.set_index('Speaker')['Script'].str.findall('|'.join(L))
.str.join('|')
.str.get_dummies()
.sum(level=0))
(out/out.sum(0)[:,None]).mul(100).astype(int).astype(str).add('%')
a b c
Speaker
Speaker1 50% 25% 25%
Speaker2 100% 0% 0%
Speaker3 0% 100% 0%
(df.set_index('Speaker')['Script'].str.extractall(f'({"|".join(L)})')
.groupby('Speaker')[0].value_counts(normalize=True)
.unstack(fill_value=0)
)
Output:
0 a b c
Speaker
Speaker 1 0.5 0.25 0.25
Speaker 2 1.0 0.00 0.00
Speaker 3 0.0 1.00 0.00
Starting from your original dataframe, if you want % and not grouped sum of dummies , you can change the entire script like below:
m = df.set_index('Speaker')['Script'].str.findall('|'.join(L)) #creates a list of matches
m = m.explode().reset_index() #explode to a series
final = pd.crosstab(m['Speaker'],m['Script'],normalize='index').mul(100) # percentage pivot
Script a b c
Speaker
Speaker 1 50.0 25.0 25.0
Speaker 2 100.0 0.0 0.0
Speaker 3 0.0 100.0 0.0
If you dont want the percentage just use:
pd.crosstab(m['Speaker'],m['Script'])
Script a b c
Speaker
Speaker 1 2 1 1
Speaker 2 2 0 0
Speaker 3 0 1 0
Note: this uses pandas 0.25+ as version