Python/Pandas: How to Match List of Strings with a DataFrame column

Here is one solution, which splits text into lower-case sets and uses sets intersection for each row:

In [160]: x['Match'] = x.Description.str.lower().str.split().map(set).to_frame('desc') \
     ...:               .apply(lambda r: (x.Employer.str.lower().str.split().map(set) & r.desc).any(),
     ...:                      axis=1)
     ...:

In [161]: x
Out[161]:
          Date           Description   Amount  AutoNumber                 Employer  Match
0    3/17/2015  WW120 TFR?FR xxx8690   140.00       49246  Cansel Survey Equipment  False
2    3/13/2015  JX154 TFR?FR xxx8690   150.00       49246  Cansel Survey Equipment  False
5     3/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment   True
9     3/2/2015  UE200 TFR?FR xxx8690   180.00       49246  Cansel Survey Equipment  False
10   2/27/2015  JH401 TFR?FR xxx8690   400.00       49246  Cansel Survey Equipment  False
11   2/27/2015   CANSEL SURVEY E PAY   555.62       49246  Cansel Survey Equipment   True
12   2/25/2015  HU204 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
13   2/23/2015  UQ263 TFR?FR xxx8690   102.00       49246  Cansel Survey Equipment  False
14   2/23/2015  UT460 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
15   2/20/2015   CANSEL SURVEY E PAY  1222.05       49246  Cansel Survey Equipment   True
17   2/17/2015  UO414 TFR?FR xxx8690   250.00       49246  Cansel Survey Equipment  False
19   2/11/2015  HI540 TFR?FR xxx8690   130.00       49246  Cansel Survey Equipment  False
20   2/11/2015  HQ010 TFR?FR xxx8690   177.00       49246  Cansel Survey Equipment  False
21   2/10/2015  WU455 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
22    2/6/2015  JJ500 TFR?FR xxx8690   301.00       49246  Cansel Survey Equipment  False
23    2/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment   True
24    2/5/2015  IR453 TFR?FR xxx8690   168.56       49246  Cansel Survey Equipment  False
26    2/2/2015  RQ574 TFR?FR xxx8690   500.00       49246  Cansel Survey Equipment  False
27    2/2/2015  UT022 TFR?FR xxx8690   850.00       49246  Cansel Survey Equipment  False
28  12/31/2014  HU521 TFR?FR xxx8690   950.17       49246  Cansel Survey Equipment  False

PS it's pretty slow as it's using not vectorized .apply(..., axis=1) method

Here is fast and memory-saving vectrorized solution, which uses sklearn.feature_extraction.text.CountVectorizer method:

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, lowercase=True)

X = vect.fit_transform(df['Employer'])
cols_emp = vect.get_feature_names()

X = vect.fit_transform(df['Description'])
cols_desc = vect.get_feature_names()

common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]

df['Match'] = (X.toarray()[:, common_cols_idx] == 1).any(1)

Source DF:

In [259]: df
Out[259]:
          Date           Description   Amount  AutoNumber                 Employer
0    3/17/2015  WW120 TFR?FR xxx8690   140.00       49246  Cansel Survey Equipment
2    3/13/2015  JX154 TFR?FR xxx8690   150.00       49246  Cansel Survey Equipment
5     3/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment
9     3/2/2015  UE200 TFR?FR xxx8690   180.00       49246  Cansel Survey Equipment
10   2/27/2015  JH401 TFR?FR xxx8690   400.00       49246  Cansel Survey Equipment
11   2/27/2015   CANSEL SURVEY E PAY   555.62       49246  Cansel Survey Equipment
12   2/25/2015  HU204 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment
13   2/23/2015  UQ263 TFR?FR xxx8690   102.00       49246  Cansel Survey Equipment
14   2/23/2015  UT460 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment
15   2/20/2015   CANSEL SURVEY E PAY  1222.05       49246  Cansel Survey Equipment
17   2/17/2015  UO414 TFR?FR xxx8690   250.00       49246  Cansel Survey Equipment
19   2/11/2015  HI540 TFR?FR xxx8690   130.00       49246  Cansel Survey Equipment
20   2/11/2015  HQ010 TFR?FR xxx8690   177.00       49246  Cansel Survey Equipment
21   2/10/2015  WU455 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment
22    2/6/2015  JJ500 TFR?FR xxx8690   301.00       49246  Cansel Survey Equipment
23    2/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment
24    2/5/2015  IR453 TFR?FR xxx8690   168.56       49246             Cansel IR453
26    2/2/2015  RQ574 TFR?FR xxx8690   500.00       49246  Cansel Survey Equipment
27    2/2/2015  UT022 TFR?FR xxx8690   850.00       49246  Cansel Survey Equipment
28  12/31/2014  HU521 TFR?FR xxx8690   950.17       49246      Cansel Survey HU521

Result:

In [261]: df
Out[261]:
          Date           Description   Amount  AutoNumber                 Employer  Match
0    3/17/2015  WW120 TFR?FR xxx8690   140.00       49246  Cansel Survey Equipment  False
2    3/13/2015  JX154 TFR?FR xxx8690   150.00       49246  Cansel Survey Equipment  False
5     3/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment   True
9     3/2/2015  UE200 TFR?FR xxx8690   180.00       49246  Cansel Survey Equipment  False
10   2/27/2015  JH401 TFR?FR xxx8690   400.00       49246  Cansel Survey Equipment  False
11   2/27/2015   CANSEL SURVEY E PAY   555.62       49246  Cansel Survey Equipment   True
12   2/25/2015  HU204 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
13   2/23/2015  UQ263 TFR?FR xxx8690   102.00       49246  Cansel Survey Equipment  False
14   2/23/2015  UT460 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
15   2/20/2015   CANSEL SURVEY E PAY  1222.05       49246  Cansel Survey Equipment   True
17   2/17/2015  UO414 TFR?FR xxx8690   250.00       49246  Cansel Survey Equipment  False
19   2/11/2015  HI540 TFR?FR xxx8690   130.00       49246  Cansel Survey Equipment  False
20   2/11/2015  HQ010 TFR?FR xxx8690   177.00       49246  Cansel Survey Equipment  False
21   2/10/2015  WU455 TFR?FR xxx8690   200.00       49246  Cansel Survey Equipment  False
22    2/6/2015  JJ500 TFR?FR xxx8690   301.00       49246  Cansel Survey Equipment  False
23    2/6/2015   CANSEL SURVEY E PAY  1182.08       49246  Cansel Survey Equipment   True
24    2/5/2015  IR453 TFR?FR xxx8690   168.56       49246             Cansel IR453   True
26    2/2/2015  RQ574 TFR?FR xxx8690   500.00       49246  Cansel Survey Equipment  False
27    2/2/2015  UT022 TFR?FR xxx8690   850.00       49246  Cansel Survey Equipment  False
28  12/31/2014  HU521 TFR?FR xxx8690   950.17       49246      Cansel Survey HU521   True

Some explanations:

In [266]: cols_desc
Out[266]:
['cansel',
 'fr',
 'hi540',
 'hq010',
 'hu204',
 'hu521',
 'ir453',
 'jh401',
 'jj500',
 'jx154',
 'pay',
 'rq574',
 'survey',
 'tfr',
 'ue200',
 'uo414',
 'uq263',
 'ut022',
 'ut460',
 'wu455',
 'ww120',
 'xxx8690']

In [267]: cols_emp
Out[267]: ['cansel', 'equipment', 'hu521', 'ir453', 'survey']

In [268]: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]

In [269]: common_cols_idx
Out[269]: [0, 5, 6, 12]

In [270]: X.toarray()
Out[270]:
array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1],
       [0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]], dtype=int64)

In [271]: X.toarray()[:, common_cols_idx]
Out[271]:
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 0, 0, 1],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 0, 0, 1],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 0, 0, 1],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [1, 0, 0, 1],
       [0, 0, 1, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 1, 0, 0]], dtype=int64)

In [272]: X.toarray()[:, common_cols_idx] == 1
Out[272]:
array([[False, False, False, False],
       [False, False, False, False],
       [ True, False, False,  True],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False,  True],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False,  True],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [ True, False, False,  True],
       [False, False,  True, False],
       [False, False, False, False],
       [False, False, False, False],
       [False,  True, False, False]], dtype=bool)

In [273]: (X.toarray()[:, common_cols_idx] == 1).any(1)
Out[273]: array([False, False,  True, False, False,  True, False, False, False,  True, False, False, False, False, False,  True,  True, Fals
e, False,  True], dtype=bool)

Here is a readable solution using an individual search_func:

def search_func(row):
    matches = [test_value in row["Description"].lower() 
               for test_value in row["Text_Search"]]

    if any(matches):
        return "Yes"
    else:
        return "No"

This function is then applied row-wise:

# create example data
df = pd.DataFrame({"Description": ["CANSEL SURVEY E PAY", "JX154 TFR?FR xxx8690"],
                   "Employer": ["Cansel Survey Equipment", "Cansel Survey Equipment"]})

print(df)
    Description             Employer
0   CANSEL SURVEY E PAY     Cansel Survey Equipment
1   JX154 TFR?FR xxx8690    Cansel Survey Equipment

# create text searches and match column
df["Text_Search"] = df["Employer"].str.lower().str.split()
df["Match"] = df.apply(search_func, axis=1)

# show result
print(df)
    Description             Employer                    Text_Search                     Match
0   CANSEL SURVEY E PAY     Cansel Survey Equipment     [cansel, survey, equipment]     Yes
1   JX154 TFR?FR xxx8690    Cansel Survey Equipment     [cansel, survey, equipment]     No

Timing comparison for different solutions

Let's prepare a bit bigger DF - 2.000 rows:

In [3]: df = pd.concat([df] * 10**2, ignore_index=True)

In [4]: df.shape
Out[4]: (2000, 5)

Solution 1: df.apply(..., axis=1):

df["Text_Search"] = df.Employer.str.lower().str.split().map(set)

In [15]: %%timeit
    ...: df.Description.str.lower().str.split().map(set).to_frame('desc') \
    ...:               .apply(lambda r: (df["Text_Search"] & r.desc).any(),
    ...:                      axis=1)
    ...:
1 loop, best of 3: 5.06 s per loop

Solution 2: CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, lowercase=True)

In [8]: %%timeit
   ...: X = vect.fit_transform(df['Employer'])
   ...: cols_emp = vect.get_feature_names()
   ...: X = vect.fit_transform(df['Description'])
   ...: cols_desc = vect.get_feature_names()
   ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
   ...: (X.toarray()[:, common_cols_idx] == 1).any(1)
   ...:
10 loops, best of 3: 88.2 ms per loop

Solution 3: df.apply(search_func, axis=1)

df["Text_Search"] = df["Employer"].str.lower().str.split()

In [12]: %%timeit
    ...: df.apply(search_func, axis=1)
    ...:
1 loop, best of 3: 362 ms per loop

NOTE: Solution 1 is too slow, so i will not "timeit" this solution for bigger DFs

Comparison of df.apply(search_func, axis=1) and CountVectorizer for 20.000 rows DF:

In [16]: df = pd.concat([df] * 10, ignore_index=True)

In [17]: df.shape
Out[17]: (20000, 6)

In [20]: %%timeit
    ...: df.apply(search_func, axis=1)
    ...:
1 loop, best of 3: 3.66 s per loop

In [21]: %%timeit
    ...: X = vect.fit_transform(df['Employer'])
    ...: cols_emp = vect.get_feature_names()
    ...: X = vect.fit_transform(df['Description'])
    ...: cols_desc = vect.get_feature_names()
    ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
    ...: (X.toarray()[:, common_cols_idx] == 1).any(1)
    ...:
1 loop, best of 3: 825 ms per loop

Comparison of df.apply(search_func, axis=1) and CountVectorizer for 200.000 rows DF:

In [22]: df = pd.concat([df] * 10, ignore_index=True)

In [23]: df.shape
Out[23]: (200000, 6)

In [24]: %%timeit
    ...: df.apply(search_func, axis=1)
    ...:
1 loop, best of 3: 36.8 s per loop

In [25]: %%timeit
    ...: X = vect.fit_transform(df['Employer'])
    ...: cols_emp = vect.get_feature_names()
    ...: X = vect.fit_transform(df['Description'])
    ...: cols_desc = vect.get_feature_names()
    ...: common_cols_idx = [i for i,col in enumerate(cols_desc) if col in cols_emp]
    ...: (X.toarray()[:, common_cols_idx] == 1).any(1)
    ...:
1 loop, best of 3: 8.28 s per loop

Conclusion: CountVectorized solution is apporx. 4.44 times faster compared to df.apply(search_func, axis=1)

Python/Pandas: How to Match List of Strings with a DataFrame column

Timing comparison for different solutions

Tags:

Python

Pandas

String

Regex

Matching

Related

Recent Posts