How to remove strings present in a list from a column in pandas
I think need str.replace
if want remove also substrings:
df['name'] = df['name'].str.replace('|'.join(To_remove_lst), '')
If possible some regex characters:
import re
df['name'] = df['name'].str.replace('|'.join(map(re.escape, To_remove_lst)), '')
print (df)
ID name
0 1 Kitty
1 2 Puppy
2 3 is example
3 4 stackoverflow
4 5 World
But if want remove only words use nested list comprehension:
df['name'] = [' '.join([y for y in x.split() if y not in To_remove_lst]) for x in df['name']]
I'd recommend re.sub
in a list comprehension for speed.
import re
p = re.compile('|'.join(map(re.escape, To_remove_lst)))
df['name'] = [p.sub('', text) for text in df['name']]
print (df)
ID name
0 1 Kitty
1 2 Puppy
2 3 is example
3 4 stackoverflow
4 5 World
List comprehensions are implemented in C and operate in C speed. I highly recommend list comprehensions when working with string and regex data over pandas str
functions for the time-being because the API is a bit slow.
The use of map(re.escape, To_remove_lst)
is to escape any possible regex metacharacters which are meant to be treated literally during replacement.
The pattern is precompiled before calling regex.sub
to reduce the overhead of compilation at each iteration.
I've also let it slide but please use PEP-8 compliant variable names "to_remove_lst" (lower-snake case).
Timings
df = pd.concat([df] * 10000)
%timeit df['name'].str.replace('|'.join(To_remove_lst), '')
%timeit [p.sub('', text) for text in df['name']]
100 ms ± 5.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
60 ms ± 3.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)