Finding if a string contains any string in a collection
It is possible to speed it up significantly with Aho-Corasick algorithm.
You can build an Aho-Corasick automaton for a collection using O(total length of all strings in a collections) time and space. Then it will be possible to check if one of the strings in a collection is a substring of a given string S in O(S.length) time by traversing this automaton.
// Make a regex pattern (once only):
StringBuilder pattern = new StringBuilder();
for (String sought : searchCollection) {
if (!Util.isNullOrEmpty(sought)) {
if (pattern.length() != 0) {
pattern.append('|');
}
pattern.append(Pattern.quote(sought));
}
}
final Pattern PATTERN = Pattern.compile("(" + pattern + ")");
This creates a pattern of alternatives like "(abc|def|ghi)"
. You might consider a case-insensitive search.
And in the function containsAny
:
Matcher m = PATTERN.matcher(searchString);
return m.find();
Regex compilation is relatively smart. It would be comparable to using a search tree of your collection of sought words: "agent" and "agitator" to ("ag", ("ent", "itator"))