Why do regex engines allow / automatically attempt matching at the end of the input string?
I am giving this answer just to demonstrate why a regex would want to allow any code appearing after the final $
anchor in the pattern. Suppose we needed to create a regex to match a string with the following rules:
- starts with three numbers
- followed by one or more letters, numbers, hyphen, or underscore
- ends with only letters and numbers
We could write the following pattern:
^\d{3}[A-Za-z0-9\-_]*[A-Za-z0-9]$
But this is a bit bulky, because we have to use two similar character classes adjacent to each other. Instead, we could write the pattern as:
^\d{3}[A-Za-z0-9\-_]+$(?<!_|-)
or
^\d{3}[A-Za-z0-9\-_]+(?<!_|-)$
Here, we eliminated one of the character classes, and instead used a negative lookbehind after the $
anchor to assert that the final character was not underscore or hyphen.
Other than a lookbehind, it makes no sense to me why a regex engine would allow something to appear after the $
anchor. My point here is that a regex engine may allow a lookbehind to appear after the $
, and there are cases for which it logically makes sense to do so.
Recall several things:
^
and$
are zero width assertions - they match right after the logical start of the string (or after each line ending in multiline mode with them
flag in most regex implementations) or at the logical end of string (or end of line BEFORE the end of line character or characters in multiline mode.).*
is potentially a zero length match of no match at all. The zero length only version would be$(?:end of line){0}
DEMO (which is useful as a comment I guess...).
does not match\n
(unless you have thes
flag) but does match the\r
in Windows CRLF line endings. So$.{1}
only matches Windows line endings for example (but don't do that. Use the literal\r\n
instead.)
There is no particular benefit other than simple side effect cases.
- The regex
$
is useful; .*
is useful.- The regex
^(?a lookahead)
and(?a lookbehind)$
are common and useful. - The regex
(?a lookaround)^
or$(?a lookaround)
are potentially useful. - The regex
$.*
is not useful and rare enough to not warrant implementing some optimization to have the engine stop looking with that edge case. Most regex engines do a decent job of parsing syntax; a missing brace or parenthesis for example. To have the engine parse$.*
as not useful would require parsing meaning of that regex as different than$(something else)
- What you get will be highly dependent on the regex flavor and the status of the
s
andm
flags.
For examples of replacements, consider the following Bash script output from some major regex flavors:
#!/bin/bash
echo "perl"
printf "123\r\n" | perl -lnE 'say if s/$.*/X/mg' | od -c
echo "sed"
printf "123\r\n" | sed -E 's/$.*/X/g' | od -c
echo "python"
printf "123\r\n" | python -c "import re, sys; print re.sub(r'$.*', 'X', sys.stdin.read(),flags=re.M) " | od -c
echo "awk"
printf "123\r\n" | awk '{gsub(/$.*/,"X")};1' | od -c
echo "ruby"
printf "123\r\n" | ruby -lne 's=$_.gsub(/$.*/,"X"); print s' | od -c
Prints:
perl
0000000 X X 2 X 3 X \r X \n
0000011
sed
0000000 1 2 3 \r X \n
0000006
python
0000000 1 2 3 \r X \n X \n
0000010
awk
0000000 1 2 3 \r X \n
0000006
ruby
0000000 1 2 3 X \n
0000005
What is the reason behind using .*
with global modifier on? Because someone somehow expects an empty string to be returned as a match or he / she isn't aware of what *
quantifier is, otherwise global modifier shouldn't be set. .*
without g
doesn't return two matches.
it's not obvious what the benefit of this behavior is.
There shouldn't be a benefit. Actually you are questioning zero-length matches existence. You are asking why does a zero-length string exist?
We have three valid places that a zero-length string exists:
- Start of subject string
- Between two characters
- End of subject string
We should look for the reason rather than the benefit of that second zero-length match output using .*
with g
modifier (or a function that searches for all occurrences). That zero-length position following an input string has some logical uses. Below state diagram is grabbed from debuggex against .*
but I added epsilon on the direct transition from start state to accept state to demonstrate a definition:
(source: pbrd.co)
That's a zero-length match (read more about epsilon transition).
These all relates to greediness and non-greediness. Without zero-length positions a regex like .??
wouldn't have a meaning. It doesn't attempt the dot first, it skips it. It matches a zero-length string for this purpose to transit the current state to a temporary acceptable state.
Without a zero-length position .??
never could skip a character in input string and that results in a whole brand new flavor.
Definition of greediness / laziness leads into zero-length matches.