create a NxN matrix from one column pandas

If you data is not too big, you can use get_dummies to encode the values and do a matrix multiplication:

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))

Output:

          0         1         2         3
0  1.000000  0.666667  1.000000  1.000000
1  0.666667  1.000000  0.666667  0.666667
2  1.000000  0.666667  1.000000  1.000000
3  1.000000  0.666667  1.000000  1.000000

Update: Here's a short explanation for the code. The main idea is to turn the given lists into one-hot-encoded:

   a  b  c  d
0  1  1  1  0
1  0  1  1  1
2  1  1  1  0
3  1  1  1  0

Once we have that, the size of intersection of the two rows, say, 0 and 1 is just their dot product, because a character belongs to both rows if and only if it is represented by 1 in both.

With that in mind, first use

df.list_of_value.explode()

to turn each cell into a series and concatenate all of those series. Output:

0    a
0    b
0    c
1    d
1    b
1    c
2    a
2    b
2    c
3    a
3    b
3    c
Name: list_of_value, dtype: object

Now, we use pd.get_dummies on that series to turn it to a one-hot-encoded dataframe:

   a  b  c  d
0  1  0  0  0
0  0  1  0  0
0  0  0  1  0
1  0  0  0  1
1  0  1  0  0
1  0  0  1  0
2  1  0  0  0
2  0  1  0  0
2  0  0  1  0
3  1  0  0  0
3  0  1  0  0
3  0  0  1  0

As you can see, each value has its own row. Since we want to combine those belong to the same original row to one row, we can just sum them by the original index. Thus

s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)

gives the binary-encoded dataframe we want. The next line

s.dot(s.T).div(s.sum(1))

is just as your logic: s.dot(s.T) computes dot products by rows, then .div(s.sum(1)) divides counts by rows.

Try this

range_of_ids = range(len(ids))

def score_calculation(s_id1,s_id2):
    s1 = set(list(df.loc[df['id'] == ids[s_id1]]['list_of_value'])[0])
    s2 = set(list(df.loc[df['id'] == ids[s_id2]]['list_of_value'])[0])
    # Resultant calculation s1&s2
    return round(len(s1&s2)/len(s1) , 2)


dic = {indexQFID:  [score_calculation(indexQFID,ind) for ind in range_of_ids] for indexQFID in range_of_ids}
dfSim = pd.DataFrame(dic)
print(dfSim)

Output

     0        1      2       3
0   1.00    0.67    1.00    1.00
1   0.67    1.00    0.67    0.67
2   1.00    0.67    1.00    1.00
3   1.00    0.67    1.00    1.00

You can also do it as following

dic = {indexQFID:  [round(len(set(s1)&set(s2))/len(s1) , 2) for s2 in df['list_of_value']] for indexQFID,s1 in zip(df['id'],df['list_of_value']) }
dfSim = pd.DataFrame(dic)
print(dfSim)

create a NxN matrix from one column pandas

Tags:

Python

Pandas

Numpy

Related

Recent Posts