Cross tabulate counts between pairs of keywords per group with pandas

Use crosstab and dot. You can then use np.triu to retain only the upper half of the matrix (everything else is set to 0).

u = pd.crosstab(df.article_id, df.keyword)
v = u.T.dot(u)
pd.DataFrame(np.triu(v, k=1), index=v.index.values, columns=v.columns.values)

   A  B  C  D  E  F
A  0  2  1  1  0  0
B  0  0  1  1  0  0
C  0  0  0  0  0  0
D  0  0  0  0  1  1
E  0  0  0  0  0  1
F  0  0  0  0  0  0

Alternatively, for the last step, you can set invalid values to "-1", as a better alternative to "-" for invalid values.

v.values[np.tril_indices_from(v)] = -1
print(v)

keyword  A  B  C  D  E  F
keyword                  
A       -1  2  1  1  0  0
B       -1 -1  1  1  0  0
C       -1 -1 -1  0  0  0
D       -1 -1 -1 -1  1  1
E       -1 -1 -1 -1 -1  1
F       -1 -1 -1 -1 -1 -1

You can also do it either using merge and crosstab

df_merge = df.merge(df, on='article_id')
pd.crosstab(df_merge['keyword_x'], df_merge['keyword_y'])

or merge and pivot_table

df_merge = df.merge(df, on='article_id')
df_merge.pivot_table('article_id', 'keyword_x', 'keyword_y', 'count', 0)

both resulting in

keyword_y  A  B  C  D  E  F
keyword_x                  
A          2  2  1  1  0  0
B          2  2  1  1  0  0
C          1  1  1  0  0  0
D          1  1  0  2  1  1
E          0  0  0  1  1  1
F          0  0  0  1  1  1

You can use product over groups and use for loops to increment the count i.e

from itertools import product 

df2 = pd.DataFrame(columns=df['keyword'].unique(),index=df['keyword'].unique()).fillna(0)


for i in df.groupby('article_id')['keyword'].apply(lambda x : product(x,x)).values:
    for k,l in i:
        if k==l:
            df2.loc[k,l]='-'
        elif df2.loc[k,l]!=0:
            df2.loc[k,l]+=1
        else:
            df2.loc[k,l]=1

df2 = df2.where((df2=='-').cumsum().T.astype(bool),'-')

   A  B  C  D  E  F
A  -  2  1  1  0  0
B  -  -  1  1  0  0
C  -  -  -  0  0  0
D  -  -  -  -  1  1
E  -  -  -  -  -  1
F  -  -  -  -  -  -

Cross tabulate counts between pairs of keywords per group with pandas

Tags:

Python

Pandas

Group By

Dataframe

Pandas Groupby

Related

Recent Posts