finding needle in haystack, what is a better solution?

I don't think it's possible to get bellow O(n) with this (because you need to iterate trough the string at least once). You can do some optimizations.

I assume you want to match "whole words", for example looking up foo should match like this:

foo and foo, or foobar and not foo.
^^^     ^^^                    ^^^

So splinting just based on space wouldn't do the job, because:

>>> 'foo and foo, or foobar and not foo.'.split(' ')
['foo', 'and', 'foo,', 'or', 'foobar', 'and', 'not', 'foo.']
#                  ^                                     ^

This is where re module comes in handy, which will allows you to build fascinating conditions. For example \b inside the regexp means:

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

So r'\bfoo\b' will match only whole word foo. Also don't forget to use re.escape():

>>> re.escape('foo.bar+')
'foo\\.bar\\+'
>>> r'\b{}\b'.format(re.escape('foo.bar+'))
'\\bfoo\\.bar\\+\\b'

All you have to do now is use re.finditer() to scan the string. Based on documentation:

Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match.

I assume that matches are generated on the fly, so they never have to be in memory at once (which may come in handy with large strings, with many matched items). And in the end just count them:

>>> r = re.compile(r'\bfoo\b')
>>> it = r.finditer('foo and foo, or foobar and not foo.')
>>> sum(1 for _ in it)
3

This does not address the complexity issue but simplifies the code:

def find_needle(n,h):
    return h.split().count(n)

finding needle in haystack, what is a better solution?

Tags:

Python

Dynamic Programming

Related

Recent Posts