Does RegularExpression support "(?R)"?

This response is really just an extended comment in support of the hypothesis that Mathematica is anchoring the pattern to the beginning and end of the string. In addition to the compelling behavioural observations reported in @2012rcampion's response, I submit the evidence of the following expressions:

ClearSystemCache[]
Trace[StringMatchQ["a", a_], TraceInternal->True]

trace output

We can see near the beginning of the trace output that the pattern is being anchored to StartOfString and EndOfString:

StringPattern`PatternConvert[StartOfString ~~ a_ ~~ EndOfString, None, 2]

Further down, we see the regular expression (?ms)\A(.)\z, again anchored to the string boundaries.

Unfortunately, this particular spelunking technique will not reveal the inner workings of RegularExpression[...] conversion. But it stands to reason that such conversions will follow the apparent anchoring policy exhibited in the simpler example.

We can see the lack of anchoring in contexts that would demand its absence, e.g.

StringCases["123AABCBAA123", RegularExpression["(.)((?R)|.?)\\1"]]
(* {"AABCBAA"} *)

StringReplace["123AABCBAA123", RegularExpression["(.)((?R)|.?)\\1"] -> "!!!"]
(* "123!!!123" *)

But even that last example is fragile. Consider what happens when we attempt to name the matched expression:

StringReplace["123AABCBAA123", r:RegularExpression["(.)((?R)|.?)\\1"] :> "("~~r~~")"]
(* "123AABCBAA123" *)

Mathematica has quietly introduced a capturing group that precedes ours, so \\1 is no longer referring to what we expect. This is readily confirmed by changing \\1 to \\2:

StringReplace["123AABCBAA123", r:RegularExpression["(.)((?R)|.?)\\2"] :> "("~~r~~")"]
(* "123(AABCBAA)123" *)

So, there is a moral to this story: assume that Mathematica is going to add elements to the regular expression. Thus, (?R) is not fully reliable since it matches the whole pattern, and the "whole pattern" is out of our control. We must content ourselves with using explicitly referenced subpatterns (e.g. (?1), (?2) etc). And even then, we must be wary of any implicit capturing groups or PCRE callbacks that Mathematica might add.

UPDATE - Named Patterns

As @2012rcampion points out in a comment, it is probably safest to use named patterns instead of numbered patterns to avoid conflicting with implicit capturing groups. PCRE in Mathematica appears to be configured so that only (?P) syntax is supported:

  • (?P<n>...) defines the pattern named n which matches ...
  • (?P=n) is a back-reference to the latest match for the pattern named n
  • (?P>n) is a reference to the pattern named n itself (e.g. for recursion)

If we adjust the pattern from the question to use named subpatterns, we get this:

DictionaryLookup[RegularExpression["(?P<a>(?P<b>.)((?P>a)|.?)(?P=b))"]]
(*
   {"aha", "aka", "bib", "bob", "boob", "bub", "CFC", "civic", "dad", "deed",
    "deified", "did", "dud", "DVD", "eke", "ere", "eve", "ewe", "eye", "gag",
    "gig", "huh", "kayak", "kook", "level", "ma'am", "madam", "mam", "MGM",
    "minim", "mom", "mum", "nan", "non", "noon", "nun", "oho", "pap", "peep",
    "pep", "pip", "poop", "pop", "pup", "radar", "redder", "refer", "repaper",
    "reviver", "rotor", "sagas", "sees", "seres", "sexes", "shahs", "sis",
    "solos", "SOS", "stats", "stets", "tat", "tenet", "TNT", "toot", "tot",
    "tut", "wow", "WWW"}
*)

This pattern:

  1. names the entire pattern a
  2. captures a character and calls it b
  3. either recurses into the entire pattern a or matches 0..1 character
  4. matches the character captured as b

This dodges the implicit string anchoring, and is relatively robust with respect to implicit captures.

UPDATE 2 - Expanded backreference syntax in v10.1+

As Alexey Popkov notes in a comment, later versions of Mathematica support new styles of backreference and subroutine reference. From version 10.1 onward, we can write our expression in Python, Perl or Oniguruma styles:

{ "(?P<a>(?P<b>.)((?P>a)|.?)(?P=b))"   (* Python, as above *)
, "(?<a>(?<b>.)((?&a)|.?)(\\k<b>))"    (* Perl *)
, "(?'a'(?'b'.)((?&a)|.?)(\\k'b'))"    (* Perl *)
, "(?<a>(?<b>.)((\\g<a>)|.?)(\\g{b}))" (* Oniguruma *)
, "(?<a>(?<b>.)((\\g'a')|.?)(\\g{b}))" (* Oniguruma *)
} // Map[DictionaryLookup @* RegularExpression] // SameQ

(* True *)

I realized I should probably double-check that the regular expression was correct... it worked on regex101, but I wanted to test it on my computer, just in case. Running the command:

grep -P '(.)((?R)|.?)\1' /usr/share/dict/words

resulted in matching every word that contained a palindrome inside it (e.g. Abbott -> bb). Of course, simply adding ^$ to match the beginning and end would't work, since the recursive match would also need to match the beginning and end (which never occurs), so we would end up with something equivalent to ^(.).?\1$ (seem familiar yet?). Also, it seems like you can't use -w with -P.

This, however, works fine:

grep -P '^((.)((?1)|.?)\2)$' /usr/share/dict/words

By changing the recursive part (now (?1)) to only recurse on the (new) first capturing group, we avoid trying to re-match the beginning and end (while \1 has been demoted to \2).

Then I went back to Mathematica and tried this:

DictionaryLookup[RegularExpression["((.)((?1)|.?)\\2)"]]

and it worked perfectly. So it seems to me that Mathematica prepends and appends ^ and $ to a regular expression used in a string pattern.


Still looking at it, but (just in case) I think you don't need that pattern to detect palindromes:

DictionaryLookup[RegularExpression["^(?:(.)(?=.*(\\1(?(2)\\2|))$))*.?\\2?$"]]
(*
{"a", "aha", "aka", "bib", "bob", "boob", "bub", "CFC", "civic", \
"dad", "deed", "deified", "did", "dud", "DVD", "eke", "ere", "eve", \
"ewe", "eye", "gag", "gig", "huh", "I", "kayak", "kook", "level", \
"ma'am", "madam", "mam", "MGM", "minim", "mom", "mum", "nan", "non", \
"noon", "nun", "oho", "pap", "peep", "pep", "pip", "poop", "pop", \
"pup", "radar", "redder", "refer", "repaper", "reviver", "rotor", \
"sagas", "sees", "seres", "sexes", "shahs", "sis", "solos", "SOS", \
"stats", "stets", "tat", "tenet", "TNT", "toot", "tot", "tut", "wow", \
"WWW"}
*)