Why does increasing precision make this program faster?
It is the combination of + and \1 in the regex
Methods
I used the following test code:
import time
import re
import string
t=time.time()
re.compile() # I tried differend regexes here
print(time.time()-t)
def test(n):
t=time.time()
match = rex.search(string.ascii_lowercase*n)
print(match, time.time()-t)
After restarting the python session, the first call to re.compile
takes longer than subsequent compilations of the same regex.
compile(sec) search (sec)
REGEX 1st 2nd short long string
r"(abcdefghijklmnopqrstuvwxyz){6,}" 10^-4 10^-5 10^-5 10^-5
r"(abcdefghijklmnopqrstuvwxyz)\1\1\1\1" 10^-4 10^-5 10^-6 10^-6
r"([a-z]+?)\1\1\1\1" 10^-4 10^-5 10^-4 10^-5
r"([a-z]+)\1\1\1\1" 10^-4 10^-5 10^-4 10^-5
r"([a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z])\1\1\1\1"
10^-4 10^-5 10^-6 10^-6
Interestingly, occasionally r"([a-z]+?)\1\1\1"
would be quick (10^-5 sec) for too short strings as well.
Discussion
There is some caching involved in compiling the rexex, but this was not the reason here.
It seems that the combination of the +
operator (both greedy and non-greedy) inside the group and the \1
in the regex is at fault. For some reason, this combination is quicker if it actually matches than if it does not match.
To find out more, we probably have to understand the C source code of the sre
module