Non-greedy match with SED regex (emulate perl's .*?)
Sed regexes match the longest match. Sed has no equivalent of non-greedy.
What we want to do is match
AB
,
followed by- any amount of anything other than
AC
,
followed by AC
Unfortunately, sed
can’t do #2 —
at least not for a multi-character regular expression. Of course,
for a single-character regular expression such as @
(or even [123]
),
we can do [^@]*
or [^123]*
.
And so we can work around sed’s limitations
by changing all occurrences of AC
to @
and then searching for
AB
,
followed by- any number of anything other than
@
,
followed by @
like this:
sed 's/AC/@/g; s/AB[^@]*@/XXX/; s/@/AC/g'
The last part changes unmatched instances of @
back to AC
.
But this is a reckless approach
because the input could already contain @
characters.
So, by matching them, we could get false positives. However,
since no shell variable will ever have a NUL (\x00
) character in it, NUL is likely a good character to use in the above work-around instead of @
:
$ echo 'ssABteAstACABnnACss' | sed 's/AC/\x00/g; s/AB[^\x00]*\x00/XXX/; s/\x00/AC/g'
ssXXXABnnACss
The use of NUL requires GNU sed. (To make sure that GNU features are enabled, the user must not have set the shell variable POSIXLY_CORRECT.)
If you are using sed with GNU's -z
flag to handle NUL-separated input, such as the output of find ... -print0
, then NUL will not be in the pattern space and NUL is a good choice for the substitution here.
Although NUL cannot be in a bash variable it is possible to include it in a printf
command. If your input string can contain any character at all, including NUL, then see Stéphane Chazelas' answer which adds a clever escaping method.
Some sed
implementations have support for that. ssed
has a PCRE mode:
ssed -R 's/AB.*?AC/XXX/'
AT&T ast sed supports the *?
operator as a non-greedy version of *
in its extended (with -E
) and augmented (with -A
regexps).
sed -E 's/AB.*?AC/XXX/'
sed -A 's/AB.*?AC/XXX/'
In that implementation and those -E
/-A
modes, more generally, perl-like regexps can be used inside (?P:perl-like regexp here)
, though as seen above, it's not necessary for the *?
operator.
Its augmented regexps also have conjunction and negation operators:
sed -A 's/AB(.*&(.*AC.*)!)AC/XXX/'
Portably, you can use this technique: replace the end string (here AC
) with a single character that doesn't occur in either the beginning or end string (like :
here) so you can do s/AB[^:]*://
, and in case that character may appear in the input, use an escaping mechanism that doesn't clash with the begin and end strings.
An example:
sed 's/_/_u/g; # use _ as the escape character, escape it
s/:/_c/g; # escape our replacement character
s/AC/:/g; # replace the end string
s/AB[^:]*:/XXX/; # actual replacement
s/:/AC/g; # restore the remaining end strings
s/_c/:/g; # revert escaping
s/_u/_/g'
With GNU sed
, an approach is to use newline as the replacement character. Because sed
processes one line at a time, newline never occurs in the pattern space, so one can do:
sed 's/AC/\n/g;s/AB[^\n]*\n/XXX/;s/\n/AC/g'
That generally doesn't work with other sed
implementations because they don't support [^\n]
. With GNU sed
you have to make sure that POSIX compatibility is not enabled (like with the POSIXLY_CORRECT environment variable).
sed - non greedy matching by Christoph Sieghart
The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:Greedy matching
% echo "<b>foo</b>bar" | sed 's/<.*>//g' bar
Non greedy matching
% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g' foobar