Efficiently querying one string against multiple regexes

I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.

I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.

I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.


This is the way lexers work.

The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).

The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.

There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.

You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.