Python string occurence count regex performance

OK, I was struggling to make it work without regexes, as we all know that regexes are slow. Here is what I came up with:

def count_occurrences(word, text):
    spaces = [' ', '\n', '(', '«', '\u201d', '\u201c', ':', "''", "__"]
    endings = spaces + ['?', '.', '!', ',', ')', '"', '»']
    s = text.lower().split(word.lower())
    l = len(s)
    return sum((
            (i == 0 and (s[0] == '' or any(s[i].endswith(t) for t in spaces)) and (s[1] == '' or any(s[i+1].startswith(t) for t in endings))) 
            or (i == l - 2 and any(s[i].endswith(t) for t in spaces) and (s[i+1] == '' or any(s[i+1].startswith(t) for t in endings)))
            or (i != 0 and i != l - 2 and any(s[i].endswith(t) for t in spaces) and any(s[i+1].startswith(t) for t in endings))
        ) for i in range(l - 1))

The whole file runs in ideone:

Ran 1 test in 0.025s

OK

Which is what the question is asking for.

The logic is pretty simple. Let's split the text by word, both lower cased. Now let's look at each couple of neighbours. If, for example index 0 finished with a valid delimiter, and index 1 starts with a valid delimiter, let's count it as an occurrence. Let's do that up to the last couple of the split.

Since performance is important here, we have to be aware to the order of spaces and endings. We are basically looking for the first in the list to fulfil the condition. So it is important to locate the variables that are more common first. For example, If I declare:

spaces = ['(', '«', '\u201d', '\u201c', ':', "''", "__", '\n', ' ']

instead of what I have in my solution, I get a run of 0.036 seconds.

If for example I declare one array:

spaces = [' ', '\n', '(', '«', '\u201d', '\u201c', ':', "''", "__", '?', '.', '!', ',', ')', '"', '»']

which has all delimiters and use only that, I get 0.053 seconds. Which is 60% more than my solution.

It is probably possible that there is a better solution with declaring the delimiters in another order.

Python string occurence count regex performance

Tags:

Python

Performance

String

Regex

Related

Recent Posts