create a NxN matrix from one column pandas
If you data is not too big, you can use get_dummies
to encode the values and do a matrix multiplication:
s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))
Output:
0 1 2 3
0 1.000000 0.666667 1.000000 1.000000
1 0.666667 1.000000 0.666667 0.666667
2 1.000000 0.666667 1.000000 1.000000
3 1.000000 0.666667 1.000000 1.000000
Update: Here's a short explanation for the code. The main idea is to turn the given lists into one-hot-encoded:
a b c d
0 1 1 1 0
1 0 1 1 1
2 1 1 1 0
3 1 1 1 0
Once we have that, the size of intersection of the two rows, say, 0
and 1
is just their dot product, because a character belongs to both rows if and only if it is represented by 1
in both.
With that in mind, first use
df.list_of_value.explode()
to turn each cell into a series and concatenate all of those series. Output:
0 a
0 b
0 c
1 d
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
Name: list_of_value, dtype: object
Now, we use pd.get_dummies
on that series to turn it to a one-hot-encoded dataframe:
a b c d
0 1 0 0 0
0 0 1 0 0
0 0 0 1 0
1 0 0 0 1
1 0 1 0 0
1 0 0 1 0
2 1 0 0 0
2 0 1 0 0
2 0 0 1 0
3 1 0 0 0
3 0 1 0 0
3 0 0 1 0
As you can see, each value has its own row. Since we want to combine those belong to the same original row to one row, we can just sum them by the original index. Thus
s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
gives the binary-encoded dataframe we want. The next line
s.dot(s.T).div(s.sum(1))
is just as your logic: s.dot(s.T)
computes dot products by rows, then .div(s.sum(1))
divides counts by rows.
Try this
range_of_ids = range(len(ids))
def score_calculation(s_id1,s_id2):
s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0])
s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0])
# Resultant calculation s1&s2
return round(len(s1&s2)/len(s1) , 2)
dic = {indexQFID: [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids}
dfSim = pd.DataFrame(dic)
print(dfSim)
Output
0 1 2 3
0 1.00 0.67 1.00 1.00
1 0.67 1.00 0.67 0.67
2 1.00 0.67 1.00 1.00
3 1.00 0.67 1.00 1.00
You can also do it as following
dic = {indexQFID: [round(len(set(s1)&set(s2))/len(s1) , 2) for s2 in df['list_of_value']] for indexQFID,s1 in zip(df['id'],df['list_of_value']) }
dfSim = pd.DataFrame(dic)
print(dfSim)