Split a pipe-delimited series, groupby a separate series, and return the counts of each split value in new columns

We can get to your desired result using some simple reshaping and aggregation:

(df.assign(genre=df['genre'].str.split('|'))
   .explode('genre')
   .groupby('year')['genre']
   .value_counts(normalize=True)
   .unstack(fill_value=0))     
 
genre       Bio     Drama   Mystery   Romance       Spy  Thriller
year                                                             
1960   0.166667  0.166667  0.166667  0.166667  0.166667  0.166667
1961   0.000000  0.500000  0.000000  0.333333  0.166667  0.000000

From here you can finish up by plotting an area plot:

Click to copy

(df.assign(genre=df['genre'].str.split('|'))
   .explode('genre')
   .groupby('year')['genre']
   .value_counts(normalize=True)
   .unstack(fill_value=0)
   .plot
   .area())

How It Works

Start by exploding your data across rows:

Click to copy

df.assign(genre=df['genre'].str.split('|')).explode('genre') 

   year     genre
0  1960     Drama
0  1960   Romance
0  1960  Thriller
1  1960       Spy
1  1960   Mystery
1  1960       Bio
2  1961     Drama
2  1961   Romance
3  1961     Drama
3  1961   Romance
4  1961     Drama
4  1961       Spy

Next, do a groupby and get the normalized count:

Click to copy

_.groupby('year')['genre'].value_counts(normalize=True)

year  genre   
1960  Bio         0.166667
      Drama       0.166667
      Mystery     0.166667
      Romance     0.166667
      Spy         0.166667
      Thriller    0.166667
1961  Drama       0.500000
      Romance     0.333333
      Spy         0.166667
Name: genre, dtype: float64

Next, unstack the result:

Click to copy

_.unstack(fill_value=0)

genre       Bio     Drama   Mystery   Romance       Spy  Thriller
year                                                             
1960   0.166667  0.166667  0.166667  0.166667  0.166667  0.166667
1961   0.000000  0.500000  0.000000  0.333333  0.166667  0.000000

Finally, plot with

Click to copy

_.plot.area()

You could re-arrange your data in the first place:

Click to copy

import pandas as pd
from itertools import groupby
from collections import defaultdict

data = """
1960  Drama|Romance|Thriller
1960         Spy|Mystery|Bio
1961           Drama|Romance
1961           Drama|Romance
1961               Drama|Spy
"""

# sort it first by year
lst = sorted((line.split() for line in data.split("\n") if line), key=lambda x: x[0])

# group it by year, expand the genres
result = {}
for key, values in groupby(lst, key=lambda x: x[0]):
    dct = defaultdict(int)
    for lst in values:
        for genre in lst[1].split("|"):
            dct[genre] += 1
    result[key] = dct

# feed it all to pandas
df = pd.DataFrame.from_dict(result, orient='index').fillna(0)

print(df)

Which would yield

Click to copy

      Drama  Romance  Thriller  Spy  Mystery  Bio
1960      1        1       1.0    1      1.0  1.0
1961      3        2       0.0    1      0.0  0.0

Split a pipe-delimited series, groupby a separate series, and return the counts of each split value in new columns

Tags:

Python

Pandas

Matplotlib

Related

Recent Posts