Adding df column finding matching values in another df for both indexed values and a dynamic source column?
You can set ID_number
as index in df2
,then use pd.Index.get_indexer
here.
df2 = df2.set_index('ID_number')
r = df2.index.get_indexer(df['ID'])
c = df2.columns.get_indexer(df['value'])
df['new_col'] = df2.values[r, c]
df
ID to ignore value new_col
0 6 foo A 0.88
1 2 whatever B 0.22
2 4 idk A 0.84
Timeits
Benchmarked using the below setup:
Tested on Ubuntu 20.04.1 LTS(focal), Cpython3.8.5, Ipython shell(7.18.1), pandas(1.1.4), numpy(1.19.2)
Setup
df2 = pd.DataFrame(
{
"ID_number": np.arange(1, 1_000_000 + 1),
"A": np.random.rand(1_000_000),
"B": np.random.rand(1_000_000),
}
)
df = pd.DataFrame(
{
"ID": np.random.randint(1, 1_000_000, 50_000),
"to ignore": ["anything"] * 50_000,
"value": np.random.choice(["A", "B"], 50_000),
}
)
Resutls:
@Vaishali
In [57]: %%timeit
...: mapper = df2.set_index('ID_number').to_dict('index')
...: df['NewCol'] = df.apply(lambda x: mapper[x['ID']][x['value']], axis =
...: 1)
...:
...:
2.09 s ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@Ch3steR
In [58]: %%timeit
...: t = df2.set_index('ID_number')
...: r = t.index.get_indexer(df['ID'])
...: c = t.columns.get_indexer(df['value'])
...: df['new_col'] = df2.values[r, c]
...:
...:
49.7 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
@Mayank
In [59]: %%timeit
...: x = df2.set_index('ID_number').stack()
...: y = df.set_index(['ID', 'value'])
...: y['NewCol'] = y.index.to_series().map(x.to_dict())
...: y.reset_index(inplace=True)
...:
...:
3.41 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
@Jezrael
In [60]: %%timeit
...: df11 = (df2.melt('ID_number', value_name='NewCol', var_name='value')
...: .drop_duplicates(['ID_number','value'])
...: .rename(columns={'ID_number':'ID'}))
...: df.merge(df11, on=['ID','value'], how='left')
...:
...:
693 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use DataFrame.melt
with DataFrame.merge
, advantage is this solution working well if no match:
df11 = (df2.melt('ID_number', value_name='NewCol', var_name='value')
.drop_duplicates(['ID_number','value'])
.rename(columns={'ID_number':'ID'}))
df = df.merge(df11, on=['ID','value'], how='left')
print (df)
ID to ignore value NewCol
0 6 foo A 0.88
1 2 whatever B 0.22
2 4 idk A 0.84