Why does increasing precision make this program faster?

It is the combination of + and \1 in the regex

Methods

I used the following test code:

import time
import re
import string
t=time.time()
re.compile() # I tried differend regexes here
print(time.time()-t)
def test(n):
    t=time.time()
    match = rex.search(string.ascii_lowercase*n)
    print(match, time.time()-t)

After restarting the python session, the first call to re.compile takes longer than subsequent compilations of the same regex.

                                        compile(sec)   search (sec)    
REGEX                                   1st     2nd    short   long string
r"(abcdefghijklmnopqrstuvwxyz){6,}"     10^-4   10^-5  10^-5   10^-5
r"(abcdefghijklmnopqrstuvwxyz)\1\1\1\1" 10^-4   10^-5  10^-6   10^-6
r"([a-z]+?)\1\1\1\1"                    10^-4   10^-5  10^-4   10^-5 
r"([a-z]+)\1\1\1\1"                     10^-4   10^-5  10^-4   10^-5
r"([a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z])\1\1\1\1"
                                        10^-4  10^-5  10^-6  10^-6

Interestingly, occasionally r"([a-z]+?)\1\1\1" would be quick (10^-5 sec) for too short strings as well.

Discussion

There is some caching involved in compiling the rexex, but this was not the reason here.

It seems that the combination of the + operator (both greedy and non-greedy) inside the group and the \1 in the regex is at fault. For some reason, this combination is quicker if it actually matches than if it does not match.

To find out more, we probably have to understand the C source code of the sre module

Why does increasing precision make this program faster?

It is the combination of + and \1 in the regex

Methods

Discussion

Tags:

Python

Performance

Python 3.X

Related

Recent Posts