Comparing lists in two columns row-wise efficiently

I will suggest you to calculate additions and removals within the same apply.

Generate a bigger example

import pandas as pd
import numpy as np
df = pd.DataFrame({'today': [['a', 'b', 'c'], ['a', 'b'], ['b']], 
                   'yesterday': [['a', 'b'], ['a'], ['a']]})
df = pd.concat([df for i in range(10_000)], ignore_index=True)

Your solution

Click to copy

%%time
additions = df.apply(lambda row: np.setdiff1d(row.today, row.yesterday), axis=1)
removals  = df.apply(lambda row: np.setdiff1d(row.yesterday, row.today), axis=1)
CPU times: user 10.9 s, sys: 29.8 ms, total: 11 s
Wall time: 11 s

Your solution on a single apply

Click to copy

%%time
df["out"] = df.apply(lambda row: [np.setdiff1d(row.today, row.yesterday),
                                  np.setdiff1d(row.yesterday, row.today)], axis=1)
df[['additions','removals']] = pd.DataFrame(df['out'].values.tolist(), 
                                            columns=['additions','removals'])
df = df.drop("out", axis=1)

CPU times: user 4.97 s, sys: 16 ms, total: 4.99 s
Wall time: 4.99 s

Using `set`

Unless your lists are very big you can avoid numpy

Click to copy

def fun(x):
    a = list(set(x["today"]).difference(set(x["yesterday"])))
    b = list((set(x["yesterday"])).difference(set(x["today"])))
    return [a,b]

%%time
df["out"] = df.apply(fun, axis=1)
df[['additions','removals']] = pd.DataFrame(df['out'].values.tolist(), 
                                            columns=['additions','removals'])
df = df.drop("out", axis=1)

CPU times: user 1.56 s, sys: 0 ns, total: 1.56 s
Wall time: 1.56 s

@r.ook's solution

If you're happy having sets instead of lists as output you can use @r.ook's code

Click to copy

%%time
temp = df[['today', 'yesterday']].applymap(set)
removals = temp.diff(periods=1, axis=1).dropna(axis=1)
additions = temp.diff(periods=-1, axis=1).dropna(axis=1) 
CPU times: user 93.1 ms, sys: 12 ms, total: 105 ms
Wall time: 104 ms

@Andreas K.'s solution

Click to copy

%%time
df['additions'] = (df['today'].apply(set) - df['yesterday'].apply(set))
df['removals'] = (df['yesterday'].apply(set) - df['today'].apply(set))

CPU times: user 161 ms, sys: 28.1 ms, total: 189 ms
Wall time: 187 ms

and you can eventually add .apply(list) to get your same output

Not sure about performance, but at the lack of a better solution this might apply:

Click to copy

temp = df[['today', 'yesterday']].applymap(set)
removals = temp.diff(periods=1, axis=1).dropna(axis=1)
additions = temp.diff(periods=-1, axis=1).dropna(axis=1)

Removals:

Click to copy

  yesterday
0        {}
1        {}
2       {a}

Additions:

Click to copy

  today
0   {c}
1   {b}
2   {b}

Click to copy

df['today'].apply(set) - df['yesterday'].apply(set)

Comparing lists in two columns row-wise efficiently

Generate a bigger example

Your solution

Your solution on a single apply

Using `set`

@r.ook's solution

@Andreas K.'s solution

Tags:

Python

Pandas

Numpy

Dataframe

Related

Recent Posts

Comparing lists in two columns row-wise efficiently

Generate a bigger example

Your solution

Your solution on a single apply

Using set

@r.ook's solution

@Andreas K.'s solution

Tags:

Python

Pandas

Numpy

Dataframe

Related

Using `set`