Python string occurence count regex performance
OK, I was struggling to make it work without regexes, as we all know that regexes are slow. Here is what I came up with:
def count_occurrences(word, text):
spaces = [' ', '\n', '(', '«', '\u201d', '\u201c', ':', "''", "__"]
endings = spaces + ['?', '.', '!', ',', ')', '"', '»']
s = text.lower().split(word.lower())
l = len(s)
return sum((
(i == 0 and (s[0] == '' or any(s[i].endswith(t) for t in spaces)) and (s[1] == '' or any(s[i+1].startswith(t) for t in endings)))
or (i == l - 2 and any(s[i].endswith(t) for t in spaces) and (s[i+1] == '' or any(s[i+1].startswith(t) for t in endings)))
or (i != 0 and i != l - 2 and any(s[i].endswith(t) for t in spaces) and any(s[i+1].startswith(t) for t in endings))
) for i in range(l - 1))
The whole file runs in ideone:
Ran 1 test in 0.025s
OK
Which is what the question is asking for.
The logic is pretty simple. Let's split the text
by word
, both lower cased. Now let's look at each couple of neighbours. If, for example index 0 finished with a valid delimiter, and index 1 starts with a valid delimiter, let's count it as an occurrence. Let's do that up to the last couple of the split.
Since performance is important here, we have to be aware to the order of spaces
and endings
. We are basically looking for the first in the list to fulfil the condition. So it is important to locate the variables that are more common first. For example, If I declare:
spaces = ['(', '«', '\u201d', '\u201c', ':', "''", "__", '\n', ' ']
instead of what I have in my solution, I get a run of 0.036
seconds.
If for example I declare one array:
spaces = [' ', '\n', '(', '«', '\u201d', '\u201c', ':', "''", "__", '?', '.', '!', ',', ')', '"', '»']
which has all delimiters and use only that, I get 0.053 seconds. Which is 60% more than my solution.
It is probably possible that there is a better solution with declaring the delimiters in another order.