Fuzzy Regular Expressions

Regexes have specific rules, they aren't built for doing what you want. It's going to be much easier to make two passes at it. Use a regex to strip off the numbers and then use a module to get your match close.

Something like this (assuming your input is lines from a file)

while( my $line = <$fh> ) {
    chomp $line;

    # do we have digits?
    if( $line =~ /^\d+/ ) {
         # removes spaces and digits from the beginning of the line
         $line =~ s/^[\d\s]*//g;

         # use your module to determine if you have a match in the remaining text.
         if( module_match ) {
             # do something
         }
         else {
             #no match
         }
    }
    else {
        # no match
    }
}

If you have one pattern you want to find the best match against a text collection you can try q-gram distance. It is quite easy to implement and adopt to special needs.

Your second description actually was helpful here, because the pattern and texts should be fairly long. q-gram distance does not work well with words like "York", but if your typical pattern is a whole address, that should be fine.

Try it like this:

transform your texts and patterns into a reduced character set, like uppercase-only, stripping, wordifying (one space between words) all symbols replaced by "#" or something.
choose a q-gram length, to work with. Try 3 or 2. We call this q=3.
than, build a qgram-profile of each text:
split each text into q-words, ie. NEW_YORK becomes [NEW, EW_, W_Y, _YO, ORK], store this away with each text.
if you search for your pattern then, you do the same with your pattern,
loop through your text-qgram-database and
- count for each pattern/text-pair how many qgrams are the same.
- each hit will raise the score by 1.
the texts with the highest score(s) are your best hits.

If you did that you can tweak this algorithm by:

prepend all you texts (and also the pattern before search), with q-1 special chars, so even your short words will get a decent profile. For example New York becomes ^^NEW YORK$$.
You can even play around with replacing all consonants with "x" and vowels with "o" and so on. Play around with a couple of character classes this way, or even create super symbols by replacing groups of character by one other, i.e. CK becomes K, or SCH becomes $.
when raising the score by a qgram-hit you can adjust the value of 1 by other things, like length-difference of text vs pattern.
store 2-grams and 3-grams both, and when counting, weigh then differently.

Note that this algorithm in the here described basic form does not have a good running time during search, i.e. O(|T|*|P|) (with |T| and |P| the total lengths of your text and pattern). This is because I described that you loop over all your texts, and then over your pattern. Therefore this is only practical for a medium-sized texts-base. If you spend some thought, you can create an advanced index structure over the q-grams (maybe using hashtables), so this might be practical for huge texts-bases as well.

Fuzzy Regular Expressions

Tags:

String

Regex

Perl

Fuzzy Search

Fuzzy Comparison

Tre Library

Related

Recent Posts