Python/Pandas: How to Match List of Strings with a DataFrame column
Here is one solution, which splits text into lower-case sets and uses sets intersection for each row:
In [160]: x['Match'] = x.Description.str.lower().str.split().map(set).to_frame('desc') \
...: .apply(lambda r: (x.Employer.str.lower().str.split().map(set) & r.desc).any(),
...: axis=1)
...:
In [161]: x
Out[161]:
Date Description Amount AutoNumber Employer Match
0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 Cansel Survey Equipment False
2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 Cansel Survey Equipment False
5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True
9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 Cansel Survey Equipment False
10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 Cansel Survey Equipment False
11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 Cansel Survey Equipment True
12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 Cansel Survey Equipment False
14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 Cansel Survey Equipment True
17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 Cansel Survey Equipment False
19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 Cansel Survey Equipment False
20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 Cansel Survey Equipment False
21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 Cansel Survey Equipment False
23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True
24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 Cansel Survey Equipment False
26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 Cansel Survey Equipment False
27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 Cansel Survey Equipment False
28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 Cansel Survey Equipment False
PS it's pretty slow as it's using not vectorized .apply(..., axis=1)
method
Here is fast and memory-saving vectrorized solution, which uses sklearn.feature_extraction.text.CountVectorizer method:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, lowercase=True)
X = vect.fit_transform(df['Employer'])
cols_emp = vect.get_feature_names()
X = vect.fit_transform(df['Description'])
cols_desc = vect.get_feature_names()
common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
df['Match'] = (X.toarray()[:, common_cols_idx] == 1).any(1)
Source DF:
In [259]: df
Out[259]:
Date Description Amount AutoNumber Employer
0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 Cansel Survey Equipment
2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 Cansel Survey Equipment
5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment
9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 Cansel Survey Equipment
10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 Cansel Survey Equipment
11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 Cansel Survey Equipment
12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment
13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 Cansel Survey Equipment
14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment
15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 Cansel Survey Equipment
17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 Cansel Survey Equipment
19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 Cansel Survey Equipment
20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 Cansel Survey Equipment
21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment
22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 Cansel Survey Equipment
23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment
24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 Cansel IR453
26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 Cansel Survey Equipment
27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 Cansel Survey Equipment
28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 Cansel Survey HU521
Result:
In [261]: df
Out[261]:
Date Description Amount AutoNumber Employer Match
0 3/17/2015 WW120 TFR?FR xxx8690 140.00 49246 Cansel Survey Equipment False
2 3/13/2015 JX154 TFR?FR xxx8690 150.00 49246 Cansel Survey Equipment False
5 3/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True
9 3/2/2015 UE200 TFR?FR xxx8690 180.00 49246 Cansel Survey Equipment False
10 2/27/2015 JH401 TFR?FR xxx8690 400.00 49246 Cansel Survey Equipment False
11 2/27/2015 CANSEL SURVEY E PAY 555.62 49246 Cansel Survey Equipment True
12 2/25/2015 HU204 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
13 2/23/2015 UQ263 TFR?FR xxx8690 102.00 49246 Cansel Survey Equipment False
14 2/23/2015 UT460 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
15 2/20/2015 CANSEL SURVEY E PAY 1222.05 49246 Cansel Survey Equipment True
17 2/17/2015 UO414 TFR?FR xxx8690 250.00 49246 Cansel Survey Equipment False
19 2/11/2015 HI540 TFR?FR xxx8690 130.00 49246 Cansel Survey Equipment False
20 2/11/2015 HQ010 TFR?FR xxx8690 177.00 49246 Cansel Survey Equipment False
21 2/10/2015 WU455 TFR?FR xxx8690 200.00 49246 Cansel Survey Equipment False
22 2/6/2015 JJ500 TFR?FR xxx8690 301.00 49246 Cansel Survey Equipment False
23 2/6/2015 CANSEL SURVEY E PAY 1182.08 49246 Cansel Survey Equipment True
24 2/5/2015 IR453 TFR?FR xxx8690 168.56 49246 Cansel IR453 True
26 2/2/2015 RQ574 TFR?FR xxx8690 500.00 49246 Cansel Survey Equipment False
27 2/2/2015 UT022 TFR?FR xxx8690 850.00 49246 Cansel Survey Equipment False
28 12/31/2014 HU521 TFR?FR xxx8690 950.17 49246 Cansel Survey HU521 True
Some explanations:
In [266]: cols_desc
Out[266]:
['cansel',
'fr',
'hi540',
'hq010',
'hu204',
'hu521',
'ir453',
'jh401',
'jj500',
'jx154',
'pay',
'rq574',
'survey',
'tfr',
'ue200',
'uo414',
'uq263',
'ut022',
'ut460',
'wu455',
'ww120',
'xxx8690']
In [267]: cols_emp
Out[267]: ['cansel', 'equipment', 'hu521', 'ir453', 'survey']
In [268]: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
In [269]: common_cols_idx
Out[269]: [0, 5, 6, 12]
In [270]: X.toarray()
Out[270]:
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
[0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int64)
In [271]: X.toarray()[:, common_cols_idx]
Out[271]:
array([[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 0, 0, 1],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 0, 0, 1],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 0, 0, 1],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[1, 0, 0, 1],
[0, 0, 1, 0],
[0, 0, 0, 0],
[0, 0, 0, 0],
[0, 1, 0, 0]], dtype=int64)
In [272]: X.toarray()[:, common_cols_idx] == 1
Out[272]:
array([[False, False, False, False],
[False, False, False, False],
[ True, False, False, True],
[False, False, False, False],
[False, False, False, False],
[ True, False, False, True],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[ True, False, False, True],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[False, False, False, False],
[ True, False, False, True],
[False, False, True, False],
[False, False, False, False],
[False, False, False, False],
[False, True, False, False]], dtype=bool)
In [273]: (X.toarray()[:, common_cols_idx] == 1).any(1)
Out[273]: array([False, False, True, False, False, True, False, False, False, True, False, False, False, False, False, True, True, Fals
e, False, True], dtype=bool)
Here is a readable solution using an individual search_func
:
def search_func(row):
matches = [test_value in row["Description"].lower()
for test_value in row["Text_Search"]]
if any(matches):
return "Yes"
else:
return "No"
This function is then applied row-wise:
# create example data
df = pd.DataFrame({"Description": ["CANSEL SURVEY E PAY", "JX154 TFR?FR xxx8690"],
"Employer": ["Cansel Survey Equipment", "Cansel Survey Equipment"]})
print(df)
Description Employer
0 CANSEL SURVEY E PAY Cansel Survey Equipment
1 JX154 TFR?FR xxx8690 Cansel Survey Equipment
# create text searches and match column
df["Text_Search"] = df["Employer"].str.lower().str.split()
df["Match"] = df.apply(search_func, axis=1)
# show result
print(df)
Description Employer Text_Search Match
0 CANSEL SURVEY E PAY Cansel Survey Equipment [cansel, survey, equipment] Yes
1 JX154 TFR?FR xxx8690 Cansel Survey Equipment [cansel, survey, equipment] No
Timing comparison for different solutions
Let's prepare a bit bigger DF - 2.000 rows:
In [3]: df = pd.concat([df] * 10**2, ignore_index=True)
In [4]: df.shape
Out[4]: (2000, 5)
Solution 1: df.apply(..., axis=1)
:
df["Text_Search"] = df.Employer.str.lower().str.split().map(set)
In [15]: %%timeit
...: df.Description.str.lower().str.split().map(set).to_frame('desc') \
...: .apply(lambda r: (df["Text_Search"] & r.desc).any(),
...: axis=1)
...:
1 loop, best of 3: 5.06 s per loop
Solution 2: CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, lowercase=True)
In [8]: %%timeit
...: X = vect.fit_transform(df['Employer'])
...: cols_emp = vect.get_feature_names()
...: X = vect.fit_transform(df['Description'])
...: cols_desc = vect.get_feature_names()
...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
...: (X.toarray()[:, common_cols_idx] == 1).any(1)
...:
10 loops, best of 3: 88.2 ms per loop
Solution 3: df.apply(search_func, axis=1)
df["Text_Search"] = df["Employer"].str.lower().str.split()
In [12]: %%timeit
...: df.apply(search_func, axis=1)
...:
1 loop, best of 3: 362 ms per loop
NOTE: Solution 1
is too slow, so i will not "timeit" this solution for bigger DFs
Comparison of df.apply(search_func, axis=1)
and CountVectorizer
for 20.000 rows DF:
In [16]: df = pd.concat([df] * 10, ignore_index=True)
In [17]: df.shape
Out[17]: (20000, 6)
In [20]: %%timeit
...: df.apply(search_func, axis=1)
...:
1 loop, best of 3: 3.66 s per loop
In [21]: %%timeit
...: X = vect.fit_transform(df['Employer'])
...: cols_emp = vect.get_feature_names()
...: X = vect.fit_transform(df['Description'])
...: cols_desc = vect.get_feature_names()
...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
...: (X.toarray()[:, common_cols_idx] == 1).any(1)
...:
1 loop, best of 3: 825 ms per loop
Comparison of df.apply(search_func, axis=1)
and CountVectorizer
for 200.000 rows DF:
In [22]: df = pd.concat([df] * 10, ignore_index=True)
In [23]: df.shape
Out[23]: (200000, 6)
In [24]: %%timeit
...: df.apply(search_func, axis=1)
...:
1 loop, best of 3: 36.8 s per loop
In [25]: %%timeit
...: X = vect.fit_transform(df['Employer'])
...: cols_emp = vect.get_feature_names()
...: X = vect.fit_transform(df['Description'])
...: cols_desc = vect.get_feature_names()
...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
...: (X.toarray()[:, common_cols_idx] == 1).any(1)
...:
1 loop, best of 3: 8.28 s per loop
Conclusion: CountVectorized
solution is apporx. 4.44 times faster compared to df.apply(search_func, axis=1)