StringReplace with multiple patterns

Short Version

Order matters when specifying replacement rules. Rules are tried from left-to-right. Each rule will attempt to match and replace as much of the string as possible before moving on to the next rule.

Patterns like ___ are very broad and will match anything. More narrowly focused patterns might be more applicable (e.g. Whitespace or Except[LetterCharacter]).

Details

For discussion, let us use the following definition to shorten forms like Join[..., translationPatterns]:

$patterns = Sequence["e" -> ".", "t" -> "-"];

We will now take on the cases one-by-one.

Case #1

StringReplace["eeeeee ttt ee", {$patterns, ___ ~~ "ttt" ~~ ___ :> "abc"}]

(* "......abc" *)

Order matters when specifying replacement rules. The Morse rules in $patterns are tried first, in order. Thus, the leading letters e are all matched by the "e" -> "." rule. But when the space is reached, then neither the "e" nor "t" rules apply. So the "ttt" rule is tried. The ___ matches the space, followed by the literal ttt string. But the final ___ matches all remaining characters, including the trailing letters e which would otherwise have been matched by other rules. So everything after the leading e sequence is replaced by abc.

Case #2

StringReplace["eeeeee ttt ee", {$patterns, __ ~~ "ttt" ~~ _ :> "abc"}]

(* "......abc.." *)

This case starts off the same as the preceding case with the letters e replaced, the space matched and the literal ttt matched. But this time, the next pattern element is simply _. This matches exactly one character, a space, and that is the end of the rule. So this time only " ttt " is replaced by "abc". Matching then continues, all rules are once again applied in left-to-right order. The remaining letters e are thus all replaced by dots.

Case #3

StringReplace["eeeeee ttt ee", {___ ~~ "ttt" ~~ ___ :> "abc", $patterns}]

(* "abc" *)

Here, we have reversed the order of the rules so that the special case for "ttt" is applied first. The first rule will match any sequence of characters followed by ttt followed by any sequence of characters. That is, it matches the whole string. Therefore the whole string is replaced.

Other Alternatives?

I am not sure what result is sought, so here are some alternatives that may prove to be useful.

Shortest

By default, variable-length patterns like ___ will match as many characters as possible. If we wrap Shortest[...] around such patterns, then they will match as few characters as possible instead:

StringReplace["eeeeee ttt ee", {Shortest[___ ~~ "ttt" ~~ ___] :> "abc", $patterns}]

(* "abc .." *)

Notice how the trailing ___ now matched zero characters, the shortest possible. The leading ___ still matched more than zero characters because that was the only way to ensure the match on the literal ttt.

Unfortunately, this pattern leaves a leftover space character in the string which may not be desirable. So...

Match Variable Spaces Instead of All Characters

To fix that, we might be explicit in saying that occurrences of ttt must be surrounded by one or more spaces:

StringReplace["eeeeee ttt ee", {" ".. ~~ "ttt" ~~ " ".. :> "abc", $patterns}]

(* "......abc.." *)

This prevents the runaway character matching that we saw when we used ___. All kinds of whitespace can be matched thus:

StringReplace["eeeeee ttt ee", {Whitespace ~~ "ttt" ~~ Whitespace :> "abc", $patterns}]

(* "......abc.." *)

Use Non-Letters As Separators

Another option would be to say that ttt must be surrounded by sequences of anything that is not a letter:

StringReplace["eeeeee!!ttt,ee"
 , { Except[LetterCharacter].. ~~ "ttt" ~~ Except[LetterCharacter].. :> "abc"
   , $patterns
   }
 ]

(* "......abc.." *)

Yet More Complex Patterns

As a parting thought, I will mention that we can invoke arbitrary functions as character pattern tests. For example, to match prime digits:

StringMatchQ["3", DigitCharacter?(PrimeQ@*ToExpression)]

(* True *)

StringMatchQ["4", DigitCharacter?(PrimeQ@*ToExpression)]

(* False *)

There are two caveats, however. First, the test will only be applied to a single character. We cannot test sequences of characters as a unit. Second, such tests involve calling back from the pattern-matching engine to the Mathematica evaluator. This slows down the matching process dramatically and might not be suitable when performance is critical.

All of the patterns mentioned in the response, and many more, are documented under the Details section for StringExpression.

Addendum - The Replacement Process

A simplified description of the replacement process is as follows. At any given point there is a current character position and a current rule, which start as the first character in the string and first replacement rule respectively. Then:

If the current character position is has reached the end of the string, the process is complete.
The current rule attempts to match as many characters as possible starting from the current character position (see below about Shortest).
If rule matches then:
- the replacement is performed,
- the first supplied rule becomes the current rule once again,
- the current character position is advanced to just after the match, and
- processing continues from step 1.
If there is another rule to try then:
- that next rule becomes the current rule, and
- processing continues from step 3.
There are no applicable rules at the current character position, so:
- the current character position is advanced by one,
- the first rule becomes the current rule, and
- processing continues from step 1.

In step 2, the use of Shortest will change the rule to match as few characters as possible while still maintaining a match. "Shortest" here means advancing the current position as little as possible. This means that characters might be trimmed from the end of the potential match, but never the beginning.

WReach covered this well, but in supplement consider using the third parameter of StringReplace to see how a replacement evolves:

$patterns = Sequence["e" -> ".", "t" -> "-"];

StringReplace[
  "eeeeee ttt ee",
  {$patterns, __ ~~ "ttt" ~~ _ :> "abc"},
  #
] & ~Array~ 10 // Column

.eeeee ttt ee
..eeee ttt ee
...eee ttt ee
....ee ttt ee
.....e ttt ee
...... ttt ee
......abcee
......abc.e
......abc..
......abc..

StringReplace with multiple patterns

Tags:

Pattern Matching

Rule

String Manipulation

Replacement

Related

Recent Posts