awk repetition {n} is not working
EREs (extended regular expressions as used by awk
or egrep
) initially didn't have {x,y}
. It was first introduced in BREs (as used by grep
or sed
), but with the \{x,y\}
syntax that didn't break backward portability.
But when it was added to EREs with that {x,y}
syntax, it did break backward portability as a foo{2}
RE was matching something different before.
So some implementations chose not to do it. You'll find that /bin/awk
, /bin/nawk
and /bin/egrep
on Solaris still don't honour it (you need to use /usr/xpg4/bin/awk
or /usr/xpg4/bin/grep -E
). Same for awk
and nawk
on FreeBSD (based on the awk
maintained by Brian Kernighan (the k
in awk
)).
For GNU awk
, until relatively recently (version 4.0), you had to call it with POSIXLY_CORRECT=anything awk '/^.{4}$/'
for it to honour it. mawk
still doesn't honour it.
Note that that operator is only syntactic sugar. .{3,5}
can always be written ....?.?
for instance (though of course {3,5}
is a lot more legible, and the equivalent of (foo.{5,9}bar){123,456}
would be a lot worse).
According to The GNU Awk User's Guide: Feature History, support for regular expression range operators was added in version 3.0 but initially required explicit command line option
New command-line options:
- New command-line options:
- The --lint-old option to warn about constructs that are not available in the original Version 7 Unix version of awk (see V7/SVR3.1).
- The -m option from BWK awk. (Brian was still at Bell Laboratories at the time.) This was later removed from both his awk and from gawk.
- The --re-interval option to provide interval expressions in regexps (see Regexp Operators).
- The --traditional option was added as a better name for --compat (see Options).
In gawk
4.0,
Interval expressions became part of default regular expressions
Since you are using gawk
3.x, you will need to use
awk --re-interval '/^.{4}$/'
or
awk --posix '/^.{4}$/'
or (thanks @StéphaneChazelas) if you want a solution that is portable, use
POSIXLY_CORRECT=anything awk '/^.{4}$/'
(since --posix
or --re-interval
would cause an error in other awk
implementations).
This works as expected with GNU awk
(gawk):
$ printf 'abcd\nabc\nabcde\n' | gawk '/^.{4}$/'
abcd
But fails with mawk
which is closer to POSIX awk
and, AFAIK, is the default on Ubuntu systems:
$ printf 'abcd\nabc\nabcde\n' | mawk '/^.{4}$/'
$ ## prints nothing
So, a simple solution would be to use gawk
instead of awk
. The {n}
notation isn't part of the POSIX BRE (basic regular expression) syntax. That's why grep
also fails here:
$ printf 'abcd\nabc\nabcde\n' | grep '^.{4}$'
$
However, it is part of ERE (extended regular expressions):
$ printf 'abcd\nabc\nabcde\n' | grep -E '^.{4}$'
abcd
I don't know which regex flavor is used by . They use an older version of ERE according to Stéphane's answer. In any case, either you are apparently using a version of mawk
or POSIX awk
, but I would guess it's BREawk
that doesn't implement ERE or your input doesn't actually have any lines with exactly 4 characters. This could happen because of whitespace that you don't see or unicode glyphs, for example.