SED: multiple patterns on the same line, how to match/parse first one

If a number is defined by digits following a NUM::

sed -n -e 's/$/\n/' -e ':begin' \
  -e 's/\(NUM:[0-9][0-9]*\)\(.*\)\n\(.*\)/\2\n\3 \1/' \
  -e 'tbegin' -e 's/.*\n //' -e '/NUM/p'

What this does is:

Put a \n at the end of the line to act as a marker.
Try to find a number before the marker, and put it at the end of the line (after the marker).
If a number was found, goto 2 above.
When no number are left before the marker, remove everything before the numbers.
If a number is on the line, print it (to handle the case where no number are found.

It can also be done the other way around, first dropping lines without numbers:

sed  -e '/NUM/!d' -e 's/$/\n/' -e ':begin' \
  -e 's/\(NUM:[0-9][0-9]*\)\(.*\)\n\(.*\)/\2\n\3 \1/' \
  -e 'tbegin' -e 's/.*\n //'

This might work for you:

echo "bla bla bla NUM:09011111111 bla bla bla bla NUM:08022222222 bla bla bla" |
sed 's/NUM:/\n&/g;s/[^\n]*\n\(NUM:[0-9]*\)[^\n]*/\1 /g;s/.$//'
NUM:09011111111 NUM:08022222222

The problem you have is understanding that the .* is greedy i.e. it matches the longest match not the first match. By placing a unique character (\n sed uses it as a line delimiter so it cannot exist in the line) in front of the string we're interested in (NUM:...) and deleting everything that is not that unique character [^\n]* followed by the unique character \n, we effectively split the string into manageable pieces.

As you know by now, sed regexes are greedy and as far as I can tell can't be made non-greedy.

Two alternatives that haven't been brought up until now are to just use other tools for this kind of matching/extraction.

You can use perl as a drop-in replacement for sed with the -pe parameters. It supports the ? non-greedy modifier:

$ perl -pe 's/.*?NUM://' data.txt
09011111111 bla bla bla bla NUM:08022222222 bla bla bla

You can use the -o option to GNU grep to get only the bits of your data that match the regex:

$ egrep -o 'NUM:[0-9]*' data.txt 
NUM:09011111111
NUM:08022222222

SED: multiple patterns on the same line, how to match/parse first one

Tags:

Regex

Parsing

Sed

Last Occurrence

Related

Recent Posts