Why special characters like = or " break PHP regexp when using \b word boundary?
The problem is your use of \b
which is a "word boundary." It's a placeholder for (^\w|\w$|\W\w|\w\W)
, where \w
is a "word" character [A-Za-z0-9_]
and \W
is the opposite. The problem is that a "
doesn't match the "word" characters, so the boundary condition is not met.
Try using a \s
instead, which will match any whitespace character.
(?:^|\s)stackoverflow=""(?:\s|$)
Characters inside a class are not interpreted, except for ^
used as a negation operator at the beginning of a class, and -
as a range operator. This is why [ ^]
wouldn't work for you. It was searching for a literal ^
.
$ php -a
Interactive shell
php > $input_line='
php ' stackoverflow="" xxx
php ' xxx stackoverflow="" xxx
php ' xxx stackoverflow=""
php ' ';
php > echo preg_replace('/(?:^|\s)stackoverflow=""(?:\s|$)/', 'OK', $input_line);
OKxxx
xxxOKxxx
xxxOK
https://regex101.com/r/nP2aB8/1
Background
From the regular-expressions.info Word boundaries page:
The metacharacter
\b
is an anchor like the caret and the dollar sign. It matches at a position that is called a "word boundary". This match is zero-length.
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
A very good explanation from nhahtdh post:
A word boundary
\b
is equivalent to:(?:(?<!\w)(?=\w)|(?<=\w)(?!\w))
Which means:
Right ahead, there is (at least) a character that is a word character, and right behind, we cannot find a word character (either the character is not a word character, or it is the start of the string).
OR
Right behind, there is (at least) a character that is a word character, and right ahead, we cannot find a word character (either the character is not a word character, or it is the end of the string).
What's wrong with your regex
The reason why \b
is not suitable is because it requires a word/non-word character to appear after/before it which depends on the immediate context on both sides of \b
. When you build a regex dynamically, you do not know which one to use, \B
or \b
. For your case, you could use '/\bstackoverflow=""\B/'
, but it would require a smart word/non-word boundary appending. However, there is an easier way: use negative lookarounds.
Solution
(?<!\w)stackoverflow=""(?!\w)
See regex demo
The regex contains negative lookarounds instead of word boundaries. The (?<!\w)
lookbehind fails the match if there is a word character before stackoverflow=""
, and (?!\w)
lookahead fails the match if stackoverflow=""
is followed by a word character.
What a word shorthand character class \w
matches depends if you enable the Unicode modifier /u
. Without it, a \w
matches just [a-zA-Z0-9_]
. You can lay further restrictions using the lookarounds.
Demo
PHP demo:
$re = '/(?<!\w)stackoverflow=""(?!\w)/';
$str = ",stackoverflow=\"\" xxx\nxxx stackoverflow=\"\" xxx\nxxx stackoverflow=\"\"\nstackoverflow=\"\" xxx";
echo preg_replace($re, "NEW=\"\"", $str);
NOTE: If you pass your string as a variable, remember to escape all special characters in it with preg_quote
:
$re = '/(?<!\w)' . preg_quote($keyword, '/') . '(?!\w)/';
Here, notice the second argument to preg_quote
, which is /
, the regex delimiter char.
"
is, of course, not special.
The word boundary, \b
, OTOH, is. It looks for a word beginning/ending, and on the boundary it expects a word character - and the quote is not such a character.
Remove it from the end or replace it with a negative look-ahead search for a word character.