How to use [\w]+ in regular expression in sed?

Different tools and versions thereof support different variants of regular expressions. The documentation of each will tell you what they support.

Standards exist so that one can rely on a minimum set of features that are available across all conforming applications.

For instance, all modern implementations of sed and grep implement basic regular expressions as specified by POSIX (at least one version or the other of the standard, but that standard has not evolved a lot in that regard in the last few decades).

In POSIX BRE and ERE, you have the [:alnum:] character class. That matches letters and digits in your locale (note that often includes a lot more than a-zA-Z0-9 unless the locale is C).

So:

grep -x '[[:alnum:]_]\{1,\}'

matches one or more alnums or _.

[\w] is required by POSIX to match either backslash or w. So you won't find a grep or sed implementation where that's available (unless via non-standard options).

The behaviour for \w alone is not specified by POSIX, so implementations are allowed to do what they want. GNU grep added that a long time ago.

GNU grep used to have its own regexp engine however it now uses the GNU libc's one (though it does embed its own copy).

It's meant to match alnums and underscore in your locale. However, it currently has a bug in that it only matches single-byte characters (for instance, not é in a UTF-8 locale even though that's clearly a letter and even though it does match é in all the locales where é is a single character).

There also is a \w regexp operator in perl regexp and in PCRE. PCRE/perl are not POSIX regular expressions, they're just another thing altogether.

Now, with the way GNU grep -P uses PCRE, it's got the same issue as without -P. It can be worked around there though by using (*UCP) (though that also has side-effects in non-UTF8 locales).

GNU sed also uses the GNU libc's regexs for its own regexps. It uses it in such a way though that it doesn't have the same bug as GNU grep.

GNU sed doesn't support PCREs. There's some evidence in the code that it has been attempted before, but it doesn't seem to be on the agenda anymore.

If you want Perl's regular expressions, just use perl though.

Otherwise, I'd say that rather than trying to rely on a bogus non-standard feature of your particular implementation of sed/grep, it would be better to stick with the standard and use [_[:alnum:]].

You are correct - \w is part of PCRE - perl compatible regular expressions. It's not part of the 'standard' regex though. http://www.regular-expressions.info/posix.html

Some versions of sed may support it, but I'd suggest the easiest way is to just use perl in sed mode by specifying the -p flag. (Along with the -e). (More detail in perlrun)

But you don't need [] around it in that example - that's for groups of valid stuff.

echo here  | perl -pe 's/\w+/gone/'

Or on Windows:

C:\>echo here  | perl -pe "s/\w+/gone/"
gone
C:\>echo here  | perl -pe "s/[\w\/]+/gone/"
gone

See perlre for more PCRE stuff.

You can get perl here: http://www.activestate.com/activeperl/downloads

How to use [\w]+ in regular expression in sed?

Tags:

Grep

Sed

Regular Expression

Related

Recent Posts