How to use [\w]+ in regular expression in sed?
Different tools and versions thereof support different variants of regular expressions. The documentation of each will tell you what they support.
Standards exist so that one can rely on a minimum set of features that are available across all conforming applications.
For instance, all modern implementations of sed
and grep
implement basic regular expressions as specified by POSIX (at least one version or the other of the standard, but that standard has not evolved a lot in that regard in the last few decades).
In POSIX BRE and ERE, you have the [:alnum:]
character class. That matches letters and digits in your locale (note that often includes a lot more than a-zA-Z0-9
unless the locale is C).
So:
grep -x '[[:alnum:]_]\{1,\}'
matches one or more alnums or _.
[\w]
is required by POSIX to match either backslash or w
. So you won't find a grep
or sed
implementation where that's available (unless via non-standard options).
The behaviour for \w
alone is not specified by POSIX, so implementations are allowed to do what they want. GNU grep
added that a long time ago.
GNU grep
used to have its own regexp engine however it now uses the GNU libc's one (though it does embed its own copy).
It's meant to match alnums and underscore in your locale. However, it currently has a bug in that it only matches single-byte characters (for instance, not é in a UTF-8 locale even though that's clearly a letter and even though it does match é in all the locales where é is a single character).
There also is a \w
regexp operator in perl regexp and in PCRE. PCRE/perl are not POSIX regular expressions, they're just another thing altogether.
Now, with the way GNU grep -P
uses PCRE, it's got the same issue as without -P
. It can be worked around there though by using (*UCP)
(though that also has side-effects in non-UTF8 locales).
GNU sed
also uses the GNU libc's regexs for its own regexps. It uses it in such a way though that it doesn't have the same bug as GNU grep
.
GNU sed
doesn't support PCREs. There's some evidence in the code that it has been attempted before, but it doesn't seem to be on the agenda anymore.
If you want Perl's regular expressions, just use perl
though.
Otherwise, I'd say that rather than trying to rely on a bogus non-standard feature of your particular implementation of sed
/grep
, it would be better to stick with the standard and use [_[:alnum:]]
.
You are correct - \w
is part of PCRE - perl compatible regular expressions. It's not part of the 'standard' regex though. http://www.regular-expressions.info/posix.html
Some versions of sed
may support it, but I'd suggest the easiest way is to just use perl
in sed
mode by specifying the -p
flag. (Along with the -e
). (More detail in perlrun
)
But you don't need []
around it in that example - that's for groups of valid stuff.
echo here | perl -pe 's/\w+/gone/'
Or on Windows:
C:\>echo here | perl -pe "s/\w+/gone/"
gone
C:\>echo here | perl -pe "s/[\w\/]+/gone/"
gone
See perlre
for more PCRE stuff.
You can get perl here: http://www.activestate.com/activeperl/downloads