Faster way to remove stop words in Python
Try caching the stopwords object, as shown below. Constructing this each time you call the function seems to be the bottleneck.
from nltk.corpus import stopwords
cachedStopWords = stopwords.words("english")
def testFuncOld():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])
def testFuncNew():
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in cachedStopWords])
if __name__ == "__main__":
for i in xrange(10000):
testFuncOld()
testFuncNew()
I ran this through the profiler: python -m cProfile -s cumulative test.py. The relevant lines are posted below.
nCalls Cumulative Time
10000 7.723 words.py:7(testFuncOld)
10000 0.140 words.py:11(testFuncNew)
So, caching the stopwords instance gives a ~70x speedup.
Use a regexp to remove all words which do not match:
import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)
This will probably be way faster than looping yourself, especially for large input strings.
If the last word in the text gets deleted by this, you may have trailing whitespace. I propose to handle this separately.
Sorry for late reply. Would prove useful for new users.
- Create a dictionary of stopwords using collections library
Use that dictionary for very fast search (time = O(1)) rather than doing it on list (time = O(stopwords))
from collections import Counter stop_words = stopwords.words('english') stopwords_dict = Counter(stop_words) text = ' '.join([word for word in text.split() if word not in stopwords_dict])