When to use re.compile
This is a tricky subject: many answers, even some legitimate sources such as David Beazley's Python Cookbook, will tell you something like:
[Use
compile()
] when you’re going to perform a lot of matches using the same pattern. This lets you compile the regex only once versus at each match. [see p. 45 of that book]
However, that really hasn't been true since sometime around Python 2.5. Here's a note straight out of the re
docs:
Note The compiled versions of the most recent patterns passed to
re.compile()
and the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.
There are two small arguments against this, but (anecdotally speaking) these won't result in noticeable timing differences the majority of the time:
- The size of the cache is limited.
- Using compiled expressions directly avoids the cache lookup overhead.
Here's a rudimentary test of the above using the 20 newsgroups text dataset. On a relative basis, the improvement in speed is about 1.6% with compiling, presumably due mostly to cache lookup.
import re
from sklearn.datasets import fetch_20newsgroups
# A list of length ~20,000, paragraphs of text
news = fetch_20newsgroups(subset='all', random_state=444).data
# The tokenizer used by most text-processing vectorizers such as TF-IDF
regex = r'(?u)\b\w\w+\b'
regex_comp = re.compile(regex)
def no_compile():
for text in news:
re.findall(regex, text)
def with_compile():
for text in news:
regex_comp.findall(text)
%timeit -r 3 -n 5 no_compile()
1.78 s ± 16.2 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
%timeit -r 3 -n 5 with_compile()
1.75 s ± 12.2 ms per loop (mean ± std. dev. of 3 runs, 5 loops each)
That really only leaves one very defensible reason to use re.compile()
:
By precompiling all expressions when the module is loaded, the compilation work is shifted to application start time, instead of to a point when the program may be responding to a user action. [source; p. 15]. It's not uncommon to see constants declared at the top of a module with
compile
. For example, in smtplib you'll findOLDSTYLE_AUTH = re.compile(r"auth=(.*)", re.I)
.
Note that compiling happens (eventually) whether or not you use re.compile()
. When you do use compile()
, you're compiling the passed regex at that moment. If you use the module-level functions like re.search()
, you're compiling and searching in this one call. The two processes below are equivalent in this regard:
# with re.compile - gets you a regular expression object (class)
# and then call its method, `.search()`.
a = re.compile('regex[es|p]') # compiling happens now
a.search('regexp') # searching happens now
# with module-level function
re.search('regex[es|p]', 'regexp') # compiling and searching both happen here
Lastly you asked,
Is there a better way to match regular words without regex?
Yes; this is mentioned as a "common problem" in the HOWTO:
Sometimes using the re module is a mistake. If you’re matching a fixed string, or a single character class, and you’re not using any re features such as the IGNORECASE flag, then the full power of regular expressions may not be required. Strings have several methods for performing operations with fixed strings and they’re usually much faster, because the implementation is a single small C loop that’s been optimized for the purpose, instead of the large, more generalized regular expression engine. [emphasis added]
...
In short, before turning to the re module, consider whether your problem can be solved with a faster and simpler string method.
Let's say that word1
, word2
... are regexes:
let's rewrite those parts:
allWords = [re.compile(m) for m in ["word1", "word2", "word3"]]
I would create one single regex for all patterns:
allWords = re.compile("|".join(["word1", "word2", "word3"])
To support regexes with |
in them, you would have to parenthesize the expressions:
allWords = re.compile("|".join("({})".format(x) for x in ["word1", "word2", "word3"])
(that also works with standard words of course, and it's still worth using regexes because of the |
part)
now this is a disguised loop with each term hardcoded:
def bar(data, allWords):
if allWords[0].search(data):
temp = data.split("word1", 1)[1] # that works only on non-regexes BTW
return(temp)
elif allWords[1].search(data):
temp = data.split("word2", 1)[1]
return(temp)
can be rewritten simply as
def bar(data, allWords):
return allWords.split(data,maxsplit=1)[1]
in terms of performance:
- regular expression is compiled at start, so it's as fast as it can be
- there's no loop or pasted expressions, the "or" part is done by the regex engine, which is most of the time some compiled code: can't beat that in pure python.
- the match & the split are done in one operation
The last hiccup is that internally the regex engine searches for all expressions in a loop, which makes that a O(n)
algorithm. To make it faster, you would have to predict which pattern is the most frequent, and put it first (my hypothesis is that regexes are "disjoint", which means that a text cannot be matched by several ones, else the longest would have to come before the shorter one)