Dealing with Repeated and Pattern at the same time

One solution would be to embed the second StringCases expression that parses the name/value pairs within the replacement rule of the first StringCases expression that extracts the elements:

$text = "some text <content1 a=1 b=22> some other <content2 c=3 d=45 e=5> rubbish";

patt = Shortest[Except[" " | "="] ..];

StringCases[$text
, ("<" ~~ head : patt ~~ cont : ((" " ~~ patt ~~ "=" ~~ patt) ..) ~~ ">") :>
  {head, StringCases[cont, " " ~~ name : patt ~~ "=" ~~ value : patt :> (name -> value)]}
]

(* {{content1, {a->1, b->2}}, {content2, {c->3, d->4, e->5}}} *)

Note that patt does not correctly extract the values (for example, b->2 should be b->22). Perhaps we should consider something like this:

StringCases[$text
, "<" ~~
   head : WordCharacter.. ~~
   cont : ((Whitespace ~~ WordCharacter.. ~~ "=" ~~ WordCharacter..) ..) ~~
   ">" :>
    { head
    , StringCases[cont
      , Whitespace ~~ name:WordCharacter.. ~~ "=" ~~ value:WordCharacter.. :>
        (name -> value)
      ]
    }
]

(* {{content1, {a->1, b->22}}, {content2, {c->3, d->45, e->5}}} *)

If the toy text is not representative of the actual data that will be processed, then the string patterns might need to be revised even further.

Is a single use of StringCases possible?

If we wish to parse the name-value pairs in a single pass through StringCases, we need to come up with something effectively like...

StringCases[$text
, (  "<"
  ~~ head:WordCharacter..
  ~~ Repeated[Whitespace~~name1:WordCharacter.. ~~ "=" ~~ value1:WordCharacter.., {0, 1}]
  ~~ Repeated[Whitespace~~name2:WordCharacter.. ~~ "=" ~~ value2:WordCharacter.., {0, 1}]
  ~~ Repeated[Whitespace~~name3:WordCharacter.. ~~ "=" ~~ value3:WordCharacter.., {0, 1}]
  ~~ Repeated[Whitespace~~name4:WordCharacter.. ~~ "=" ~~ value4:WordCharacter.., {0, 1}]
  ~~ Repeated[Whitespace~~name5:WordCharacter.. ~~ "=" ~~ value5:WordCharacter.., {0, 1}]
  ~~ ">"
  ) :>
  { head
  , {name1 -> value1, name2 -> value2, name3 -> value3, name4 -> value5} //
      DeleteCases["" -> ""]
  }
]

(* {{content1, {a->1, b->22}}, {content2, {c->3, d->45, e->5}}} *)

... except without having an upper limit on the number of pairs, and without it being so horribly verbose.

Mathematica presents numerous obstacles to this goal, most of which are noted in the question:

  • If a pattern is named within a repeating group, then all members of the group are constrained so that the named portions match. In the case at hand, this forces all the names within an element's name/value pairs to be the same (and similarly, all values the same).
  • If a pattern is not named within a repeating group, there is no way to access its value except by pattern test (even for regular expressions since the PCRE engine only remembers the last value for a repeated capture group).
  • But if a pattern test is used, it is called for each matched character individually instead of for the whole matched string.
  • Even if we use a pattern test, we cannot rely upon the order in which characters are supplied to it since the pattern-matcher may back-track and present us with any given character multiple times.

Given these considerations, the goal is not achievable in general.

However...

In this particular case, if the supplied string is well-formed, then the pattern matcher need never back-track. So this opens a tiny window through which we can squeeze a heinous hack:

Module[{name, value, rules}
, StringCases[
  $text
  , (  "<"
    ~~ (head:WordCharacter.. /; (rules = {}; True))
    ~~ (Whitespace /; (name = ""; True))
    ~~ (  WordCharacter.. ? ((name = StringJoin[name, #]; True)&)
       ~~ ("=" /; (value = ""; True))
       ~~ WordCharacter.. ? ((value = StringJoin[value, #]; True)&)
       ~~ (Whitespace|">") ? ((AppendTo[rules, name -> value]; name = ""; True)&)
       )..
    ) :> { head, rules}
  ]
]

Most of the subpatterns in this expression are guarded by either a condition or a pattern test. These will collectively be referred to as "tests". Each test is unconventional in that it incurs a side-effect before unconditionally returning true. These side-effects are used to gather up the matched parts and slowly assemble them into the final result. In particular:

  • As each "head" is matched, a list holding its set of rules is set to be empty.
  • When the following whitespace is matched, the name variable is set to the empty string.
  • As each name character is matched, it is appended to name variable.
  • When the equal sign is matched, the value variable is set to the empty string.
  • As each value character is matched, it is appended to the value variable.
  • When the space or greater-than character is reached at the end of a value, a rule of the form name -> value is appended to the rule list and the name variable is reset to blank ready for the next pair (if any).
  • Finally, when the entire element has been matched, the result is a list of the head along with its rules.

This whole process is crucially dependent upon there being no back-tracking within the pattern matcher. It may work in this particular case, but this strategy is not applicable to more complex patterns or input strings that could be considered ill-formed.

For my part, I consider this hack to be too tricky. I would recommend just accepting the fact that portions of the input string will need to be scanned at least twice in order to fully parse them. If such rescanning is impractical for the real application then I think a custom parser should be implemented.

Parse as HTML?

As a completely different strategy, we could observe that the input string very much resembles HTML and process it as such:

Cases[ImportString[$text, {"HTML", "XMLObject"}]
, XMLElement[tag_, attrs_, ___] :> {tag, attrs}
, {4, Infinity}
]

(* {{content1, {a->1, b->22}}, {content2, {c->3, d->45, e->5}}} *)

The level range {4, Infinity} was carefully chosen to ignore the html and body elements that the parser wraps around any HTML fragments.


Here is a crude attempt at a PatternTest for phrases (sequences of characters). It is rather hacky and probably extremely fragile for the reasons given by WReach.

The syntax is phrasetest[pat, func]

The idea is to supply an unconditionally true PatternTest to the pattern pat, which assembles the individual characters into a string as a side effect. Once the pattern provisionally matches, the assembled string is tested by func in a Condition. If the test returns False the string is reset to empty and the Condition fails, otherwise the Condition succeeds and the pattern matches. An additional trivial PatternTest resets the string for next time.

phrasetest[pat_, func_] := Module[{s = ""},
  ((pat?((s = s <> #; True) &)) /; func[s] || (s = ""))?((s = ""; True) &)]

A simple example - the pattern is BlankSequence[] and the test is for a string length of 3:

pt = phrasetest[__, StringLength[#] == 3 &];

StringCases["123abc456", x : pt :> StringReverse[x]]
(* {"321", "cba", "654"} *)

With multiple phrasetests you could pick out the various parts of the string and collect them with Sow. In this example I've used Echo instead:

patt = Except@Characters[" =<>"] ..;

pt0 = phrasetest[patt, (Echo[#, "head"]; True) &];    
pt1 = phrasetest[patt, (Echo[#, "variable"]; True) &];    
pt2 = phrasetest[patt, (Echo[#, "value"]; True) &];

StringCases["some text <content1 a=31 b=2> some other <content2 c=3 d=34 e=2> rubbish",
 ("<" ~~ pt0 ~~ (" " ~~ pt1 ~~ "=" ~~ pt2) .. ~~ ">")]

enter image description here

Of course the result from StringCases is wrong, you would need to use Sow/Reap and do further processing on the output. It doesn't fulfil the desire to do everything inside a single StringCases, but it does show a sort of pattern test for substrings.