Should I use \\\" or \" to match " in RegularExpression?
Short Version: Use ".*\".*"
to match an embedded quote, ".*\\\\.*"
to match an embedded backslash.
This question deals with two distinct syntaxes -- Mathematica string syntax and regular expression syntax. Both syntaxes use \
as an escape character, so we'll need separate the two levels to see what is happening.
First, let's deal with the Mathematica syntax. In Mathematica strings, both \
and "
must be escaped by a preceding backslash. Mathematica interprets the exhibited strings as follows:
Mathematica Syntax String Content ".*\".*" .*".* ".*\\\".*" .*\".* ".*\\.*" .*\.* ".*\\\\.*" .*\\.*
Having stripped away the Mathematica layer, the questions are now reduced to...
Are the regular expressions .*".*
and .*\".*
equivalent?
Yes.
Mathematica uses PCRE to implement its regular expressions. The key point concerns the meaning of \"
. Quoting from the PCRE manual page:
[...] only ASCII numbers and letters have any special meaning after a backslash. All other characters [...] are treated as literals.
Since "
has no special regular expression meaning, \"
is equivalent to "
. Therefore .*".*
and .*\".*
are equivalent expressions. The shorter version is more common (Mathematica syntax: ".*\".*"
).
Are the regular expressions .*\.*
and .*\\.*
equivalent?
No!
The same PCRE backslash rule applies here. \.
is interpreted as a literal .
and \\
is interpreted as a literal backslash. So, the first regular expression, .*\.*
, means zero or more characters, followed by zero or more literal periods. The second regular expression, .*\\.*
means zero or more characters, followed by a literal backslash, followed by zero or more characters. The two expressions have different meanings!
The following examples show that the two expressions are not equivalent (take heed of the switch back to Mathematica string syntax):
$r1 = RegularExpression[".*\\.*"];
StringMatchQ["abc", $r1] (* True *)
StringMatchQ["abc...", $r1] (* True *)
StringMatchQ["abc\\def", $r1] (* True *)
$r2 = RegularExpression[".*\\\\.*"];
StringMatchQ["abc", $r2] (* False *)
StringMatchQ["abc...", $r2] (* False *)
StringMatchQ["abc\\def", $r2] (* True *)
For the most part, the first regular expression can be simplified to .*
. The second, however, is already in the simplest form assuming that one wants to match a string that contains a backslash.