Find occurrences of huge list of phrases in text
Maybe you should try flashtext.
According to the author, it is much more faster than Regex.
The author even published a paper for this library.
I've personally tried this library for one of my project, in my opinion its API is quite friendly and usable.
Hope it helps.
I faced an almost identical problem with my own chat page system. I wanted to be able to add a link to a number of keywords (with slight variations) that were present in the text. I only had around 200 phrases
though to check.
I decided to try using a standard regular expression for the problem to see how fast it would be. The main bottleneck was in constructing the regular expression. I decided to pre-compile this and found the match time was very fast for shorter texts.
The following approach takes a list of phrases
, where each contains phrase
and link
keys. It first constructs a reverse lookup dictionary:
{'phrase to match' : 'link_url', 'another phrase' : 'link_url2'}
Next it compiles a regular expression in the following form, this allows for matches which contain different amounts of white space between words:
(phrase\s+to\s+match|another\s+phrase)
Then for each piece of text (e.g. 2000 words each), it uses finditer()
to get each match. The match
object gives you .span()
giving the start and end location of the matching text and group(1)
gives the matched text. As the text can possibly have extra whitespace, re_whitespace
is first applied to remove it and bring it back to the form stored in the reverse
dictionary. With this, it is possible to automatically look up the required link
:
import re
texts = ['this is a phrase to match', 'another phrase this is']
phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]
reverse = {d['phrase']:d['link'] for d in sorted(phrases, key=lambda x: x['phrase'])}
re_whitespace = re.compile(r'\s+')
re_phrases = re.compile('({})'.format('|'.join(d['phrase'].replace(' ', r'\s+') for d in phrases)))
for text in texts:
matches = [(match.span(), reverse[re_whitespace.sub(' ', match.group(1))]) for match in re_phrases.finditer(text)]
print(matches)
Which would display the matches for the two texts as:
[((0, 7), 'link_url2'), ((10, 30), 'link_url')]
[((15, 23), 'link_url2')]
To test how this scales, I have tested it by importing a list of English words from nltk
and automatically creating 80,000
two to six word phrases along with unique links. I then timed it on two suitably long texts:
import re
import random
from nltk.corpus import words
import time
english = words.words()
def random_phrase(l=2, h=6):
return ' '.join(random.sample(english, random.randint(l, h)))
texts = ['this is a phrase to match', 'another phrase this is']
# Make texts ~2000 characters
texts = ['{} {}'.format(t, random_phrase(200, 200)) for t in texts]
phrases = [{'phrase': 'phrase to match', 'link': 'link_url'}, {'phrase': 'this is', 'link': 'link_url2'}]
#Simulate 80k phrases
for x in range(80000):
phrases.append({'phrase': random_phrase(), 'link': 'link{}'.format(x)})
construct_time = time.time()
reverse = {d['phrase']:d['link'] for d in phrases}
re_whitespace = re.compile(r'\s+')
re_phrases = re.compile('({})'.format('|'.join(d['phrase'].replace(' ', r'\s+') for d in sorted(phrases, key=lambda x: len(x['phrase'])))))
print('Time to construct:', time.time() - construct_time)
print()
for text in texts:
start_time = time.time()
print('{} characters - "{}..."'.format(len(text), text[:60]))
matches = [(match.span(), reverse[re_whitespace.sub(' ', match.group(1))]) for match in re_phrases.finditer(text)]
print(matches)
print('Time taken:', time.time() - start_time)
print()
This takes ~17 seconds to construct the regular expression and reverse lookup (which is only needed once). It then takes about 6 seconds per text. For very short text it takes ~0.06 seconds per text.
Time to construct: 16.812477111816406
2092 characters - "this is a phrase to match totaquine externize intoxatio..."
[((0, 7), 'link_url2'), ((10, 30), 'link_url')]
Time taken: 6.000027656555176
2189 characters - "another phrase this is political procoracoidal playstead as..."
[((15, 23), 'link_url2')]
Time taken: 6.190425715255737
This will at least give you an idea to compare against.
To get a reasonable speed while matching 80k patterns, you definitely need some preprocessing on the patterns, single-shot algorithms like Boyer-Moore
won't help much.
You'll probably also need to do the work in compiled code (think C extension) to get reasonable throughput. Regarding how to preprocess the patterns - one option is state machines like Aho-Corasick
or some generic finite state transducer. The next option is something like a suffix array
based index, and the last one that comes to my mind is inverted index.
If your matches are exact and the patterns respect word boundaries, chances are that a well implemented word or word-ngram keyed inverted index
will be fast enough even in pure Python. The index is not a complete solution, it will rather give you a few candidate phrases which you need to check with normal string matching for a complete match.
If you need approximate matching, character-ngram inverted index is your choice.
Regarding real implementations - flashtext mentioned in other answer here seems to be a reasonable pure Python solution if you're OK with the full-phrase-only limitation.
Otherwise you can get reasonable results with generic multi-pattern capable regexp libraries: one of the fastest should be Intel's hyperscan - there are even some rudimentary python bindings available.
Other option is Google's RE2 with Python bindings from Facebook. You want to use RE2::Set
in this case.