Regex replace text but exclude when text is between specific tag
This should do the trick:
(?!<a[^>]*>)(Test)(?![^<]*</a>)
Try it yourself on regexr.
Follow-up: As Adam explains above, the first part has no effect and can be dropped entirely:
(Test)(?![^<]*</a>)
Answer
Use
(Test)(?!(.(?!<a))*</a>)
Explanation
Let me remind you of the meaning of some symbols:
1) ?!
is a negative lookahead, for example r(?!d)
selects all r
that are not directly followed by an d
:
2) Therefore never start a negative lookahead without a character. Just (?!d)
is meaningless:
3) The ?
can be used as a lazy match. For example .+E
would select from
123EEE
the whole string 123EEE
. However, .+?E
selects as few "any charater" (.+
) as needed. It would only select 123E
.
Answer:
Protist answer is that you should use (?!<a[^>]*?>)(Test)(?![^<]*?</a>)
. Let me explain how to make this shorter first.
As mentioned in 2), it is meaningless to put a lookahead before the match. So the following is equivalent to protist answer:
(Test)(?![^<]*?</a>)
also since <
is not allowed, the lazy match ?
is superfluous, so its also equivalent to
(Test)(?![^<]*</a>)
This selects all Test
that are not followed by an </a>
without the symbol <
in between. This is why Test which appears before or after any <a ...> .. </a>
will be replaced.
However, note that
Lorem Test dolor <a href="http://Test.com/url">Test <strong>dolor</strong></a> eirmod
would be changed to
Lorem 1234 dolor <a href="http://1234.com/url">1234 <strong>dolor</strong></a> eirmod
In order to catch that you could change your regex to
(Test)(?!(.(?!<a))*</a>)
which does the following:
Select every word
Test
that is not followed by a string***</a>
where each character in***
is not followed by<a
.
Note that the dot .
is important (see 2)).
Note that a lazy match like (Test)(?!(.(?!<a))*?</a>)
is not relevant because nested links are illegal in HTML4 and HTML5 (smth like <a href="#">..<a href="#">...</a>..</a>)
.
protist said
Also, using regexes on raw HTML is not recommended.
I agree with that. A problem is that it would cause problems if a tag is not closed or opened. For example all mentioned solutions here would change
Lorem Test dolor Test <strong>dolor</strong></a> eirmod
to
Lorem Test dolor Test <strong>dolor</strong></a> eirmod 1234 dolores sea 1234 takimata
(?!<a[^>]*?>)(Test)(?![^<]*?</a>)
same as zb226, but optimized with a lazy match
Also, using regexes on raw HTML is not recommended.