How to make pandas dataframe str.contains search faster

BrenBarn's answer above helped me solve my issue. Just writing down my issue and how it was solved below. Hope it helps someone :)

Data I had was around 2000 rows. It had mostly text. Previously, I used regular expression with ignore case, shown below

reg_exp = ''.join(['(?=.*%s)' % (i) for i in search_list])
series_to_search = data_new.iloc[:,title_column_index] + ' : ' + data_new.iloc[:,description_column_index]  
data_new = data_new[series_to_search.str.contains(reg_exp, flags=re.IGNORECASE)]

This code, for a search list containing ['exception', 'VE20'], took 58.710898 seconds.

When I replaced this code with a simple for loop, it took only 0.055304 seconds. An improvement of 1,061.60 times !!!

for search in search_list:            
    series_to_search = data_new.iloc[:,title_column_index] + ' : ' + data_new.iloc[:,description_column_index]
    data_new = data_new[series_to_search.str.lower().str.contains(search.lower())]

You could converting it to a list. It seems that searching in a list rather than applying string methods to a series is significantly faster.

Sample code:

import timeit
df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})



def first_way():
    df["new"] = pd.Series(df["col"].str.contains('Donald',case=True,na=False))
    return None
print "First_way: "
%timeit for x in range(10): first_way()
print df

df = pd.DataFrame({'col': ["very definition of the American success story, continually setting the standards of excellence in business, real estate and entertainment.",
                       "The myriad vulgarities of Donald Trump—examples of which are retailed daily on Web sites and front pages these days—are not news to those of us who have",
                       "While a fearful nation watched the terrorists attack again, striking the cafés of Paris and the conference rooms of San Bernardino"]})


def second_way():
    listed = df["col"].tolist()
    df["new"] = ["Donald" in n for n in listed]
    return None

print "Second way: "
%timeit for x in range(10): second_way()
print df

Results:

First_way: 
100 loops, best of 3: 2.77 ms per loop
                                                 col    new
0  very definition of the American success story,...  False
1  The myriad vulgarities of Donald Trump—example...   True
2  While a fearful nation watched the terrorists ...  False
Second way: 
1000 loops, best of 3: 1.79 ms per loop
                                                 col    new
0  very definition of the American success story,...  False
1  The myriad vulgarities of Donald Trump—example...   True
2  While a fearful nation watched the terrorists ...  False

If the number of substrings is small, it may be faster to search for one at a time, because you can pass the regex=False argument to contains, which speeds it up.

On a sample DataFrame of about 6000 rows that I tested it with on two sample substrings, blah.contains("foo", regex=False)| blah.contains("bar", regex=False) was about twice as fast as blah.contains("foo|bar"). You'd have to test it with your data to see how it scales.

How to make pandas dataframe str.contains search faster

Tags:

Python

Pandas

Dataframe

Related

Recent Posts