How to reduce the greediness of a regular expression in AWK?

If you want to select @ and up to the first , after that, you need to specify it as @[^,]*,

That is @ followed by any number (*) of non-commas ([^,]) followed by a comma (,).

That approach works as the equivalent of @.*?,, but not for things like @.*?string, that is where what's after is more than a single character. Negating a character is easy, but negating strings in regexps is a lot more difficult.

A different approach is to pre-process your input to replace or prepend the string with a character that otherwise doesn't occur in your input:

gsub(/string/, "\1&") # pre-process
gsub(/@[^\1]*\1string/, "")
gsub(/\1/, "") # revert the pre-processing

If you can't guarantee that the input won't contain your replacement character (\1 above), one approach is to use an escaping mechanism:

gsub(/\1/, "\1\3") # use \1 as the escape character and escape itself as \1\3
                   # in case it's present in the input
gsub(/\2/, "\1\4") # use \2 as our maker character and escape it
                   # as \1\4 in case it's present in the input
gsub(/string/, "\2&") # mark the "string" occurrences

gsub(/@[^\2]*\2string/, "")

# then roll back the marking and escaping
gsub(/\2/, "")
gsub(/\1\4/, "\2")
gsub(/\1\3/, "\1")

That works for fixed strings but not for arbitrary regexps like for the equivalent of @.*?foo.bar.

There are already several good answers providing work-arounds for awk's inability to do non-greedy matches, so I'm providing some information on an alternative way to do it using Perl Compatible Regular Expressions (PCRE). Note that most simple "match and print" awk scripts can easily be re-implemented in perl using the -n command-line option, and more complex scripts can be converted with the a2p Awk to Perl translator.

Perl has a non-greedy operator which can be used in Perl scripts and anything that uses PCRE. For example, also implemented in GNU grep's -P option.

PCRE is not identical to Perl's regular expressions, but it is very close. It is a popular choice of a regular expression library for many programs, because it's very fast, and the Perl enhancements to extended regular expressions are very useful.

From the perlre(1) man page:

   By default, a quantified subpattern is "greedy", that is, it will match
   as many times as possible (given a particular starting location) while
   still allowing the rest of the pattern to match.  If you want it to
   match the minimum number of times possible, follow the quantifier with
   a "?".  Note that the meanings don't change, just the "greediness":

       *?        Match 0 or more times, not greedily
       +?        Match 1 or more times, not greedily
       ??        Match 0 or 1 time, not greedily
       {n}?      Match exactly n times, not greedily (redundant)
       {n,}?     Match at least n times, not greedily
       {n,m}?    Match at least n but not more than m times, not greedily

This is an old post, but the following information might be useful for others.

There is a way, admittedly crude, to perform non-greedy RE matching in awk. The basic idea is to use the match(string, RE) function, and progressively reduce the size of the string until the match fails, something like (untested):

if (match(string, RE)) {
    rstart = RSTART
    for (i=RLENGTH; i>=1; i--)
        if (!(match(substr(string,1,rstart+i-1), RE))) break;
    # At this point, the non-greedy match will start at rstart
    #  for a length of i+1
}

How to reduce the greediness of a regular expression in AWK?

Tags:

Awk

Regular Expression

Related

Recent Posts