Perl: Matching string not containing PATTERN
jippie, first, here's a tip. If you see a regex that is not immediately obvious to you, you can dump it in a tool that explains every token.
For instance, here is the RegexBuddy output:
"
^ # Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
(?: # Match the regular expression below
(?! # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
PATTERN # Match the character string “PATTERN” literally (case insensitive)
)
. # Match any single character that is NOT a line break character (line feed)
)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$ # Assert position at the end of a line (at the end of the string or before a line break character) (line feed)
# Perl 5.18 allows a zero-length match at the position where the previous match ends.
# Perl 5.18 attempts the next match at the same position as the previous match if it was zero-length and may find a non-zero-length match at the same position.
"
Some people also use regex101.
A Human Explanation
Now if I had to explain the regex, I would not be so linear. I would start by saying that it is fully anchored by the ^
and the $
, implying that the only possible match is the whole string, not a substring of that string.
Then we come to the meat: a non-capturing group introduced by (?:
and repeated any number of times by the *
What does this group do? It contains
- a negative lookahead (you may want to read up on lookarounds here) asserting that at this exact position in the string, we cannot match the word PATTERN,
- then a dot to match the next character
This means that at each position in the string, we assert that we cannot match PATTERN, then we match the next character.
If PATTERN can be matched anywhere, the negative lookahead fails, and so does the entire regex.
Building it up piece by piece (and throughout assuming no newlines in the string or PATTERN):
This matches any string:
/^.*$/
But we don't want .
to match a character that starts PATTERN, so replace
.
with
(?!PATTERN).
This uses a negative look-ahead that tests a given pattern without actually consuming any of the string and only succeeds if the pattern does not match at the given point in the string. So it's like saying:
if PATTERN doesn't match at this point,
match the next character
This needs to be done for every character in the string, so *
is used to match zero or more times, from the beginning to the end of the string.
To make the *
apply to the combination of the negative look-ahead and .
, not just the .
, it needs to be surrounded by parentheses, and since there's no reason to capture, they should be non-capturing parentheses (?: )
:
(?:(?!PATTERN).)*
And putting back the anchors to make sure we test at every position in the string:
/^(?:(?!PATTERN).)*$/
Note that this solution is particularly useful as part of a larger match; e.g. to match any string with foo
and later baz
but no bar
in between:
/foo(?:(?!bar).)*baz/
If there aren't such considerations, you can simply do:
/^(?!.*PATTERN)/
to check that PATTERN does not match anywhere in the string.
About newlines: there are two problems with your regex and newlines. First, .
doesn't match newlines, so "foo\nbar" =~ /^(?:(?!baz).)*$/
doesn't match, even though the string does not contain baz. You need to add the /s flag to make .
match any character; "foo\nbar" =~ /^(?:(?!baz).)*$/s
correctly matches. Second, $
doesn't match just at the end of the string, it also can match before a newline at the end of the string. So "foo\n" =~ /^(?:(?!\s).)*$/s
does match, even though the string contains whitespace and you are attempting to only match strings with no whitespace; \z
always only matches at the end, so "foo\n" =~ /^(?:(?!\s).)*\z/s
correctly fails to match the string that does in fact contain a \s
. So the correct general purpose regex is:
/^(?:(?!PATTERN).)*\z/s