Highlighting pattern strings in a large text
There is another way that is on my machine almost 500x faster then your solution. The idea is to look how Mathematica represents colored strings and use this directly.
When we colorize an input string by selecting text and using the Format menu, we can create something like this
Now, press Ctrl+Shift+E to see the underlying expression.
Cell[BoxData["\"\<Hello
\!\(\*StyleBox[\"my\",FontColor->RGBColor[1, 0, 0]]\)
friend\>\""], "Input"]
I have put the important part in the second like and you see, it's only an inline style-box that is used.
In your updated question, you used a list of words to highlight and for this task, there is another approach useful:
- we create a function that takes a string and returns the same colorized string when it is in your list of words. Otherwise, it just returns the same string
- we split your input into words and apply this function to each word
- we rebuild all words into a string again which now contains normal text and highlighted words.
For this purpose, I use a Module
that on-the-fly creates local functions that do the highlighting. This is important, because with each call to highlight
you might want to provide a different list of words to highlight. Therefore, the function doHighlight
needs to be rebuilt on every call.
Sounds expensive? It is not and the implementation is only a few lines long:
highlight[txt_, words_] := Module[{colorize, doHighlight},
colorize[str_] := "\!\(\*StyleBox[\"" <> str <>
"\",FontColor->RGBColor[0, 0, 1]]\)";
SetAttributes[doHighlight, {Listable}];
(doHighlight[#] := colorize[#]) & /@ words;
doHighlight[s_] := s;
StringRiffle[doHighlight[StringSplit[txt]]]
]
Let's test it
Now let us time this with the same input that Peter Roberge used. His function needed 3.7 seconds on my machine.
txt = ExampleData[{"Text", "AeneidEnglish"}];
somewords = DictionaryLookup[RegularExpression["[A-Z][a-z]+"]];
output = highlight[txt, somewords]; // AbsoluteTiming
(* {0.168501, Null} *)
And the text is highlighted as expected
Since you were brave enough to read until the end, let me tell you that there is one significant drawback: Mathematica has a bug and does not export colored strings to rtf
correctly. At least on my machine, the text is not colorized in the final rtf
.
Update
In case you really need to replace not a fixed word, but an expression you need to use StringReplace
because it is possible you match more than one word (maybe a group of words). Therefore, splitting the text into words won't always work.
Nevertheless, the basic idea of my answer stays the same: We don't use Row
and Style
, but we inject inline string styles and transform a string into string.
The function itself becomes very easy:
highlight2[txt_, patterns_] :=
StringReplace[txt, str : (Alternatives @@ patterns) :>
"\!\(\*StyleBox[\"" <> str <> "\",FontColor->RGBColor[0, 0, 1]]\)"
]
Here a short test with different kinds of patterns:
highlight2["Hello bear, what are you doing here?",
{ "b" ~~ LetterCharacter ..,
_ ~~ "o" ~~ _,
RegularExpression["[A-Z][a-z]+"],
"re?"
}]
Update to provide custom style
Providing a custom style is possible too. You can just add this as parameter and the only thing you have to do inside the function is to transform this into a string and put it at the right place.
That being said:
highlight2[txt_, patterns_] := highlight2[txt, patterns, {Blue}];
highlight2[txt_, patterns_, {style__}] :=
StringReplace[txt,
str : (Alternatives @@ patterns) :>
"\!\(\*StyleBox[\"" <> str <> "\"," <>
StringRiffle[ToString /@ {style}, ", "]
<> "]\)"]
You can now give a list of style directives as last argument. When you leave them out, then the matching text becomes blue.
highlight2["Hello bear, what are you doing here?", {"b" ~~
LetterCharacter .., _ ~~ "o" ~~ _},
{30, Red, Italic}]
I really enjoy Mathematica when I can outsource tough algorithmic decisions to their source code- I believe this is the case here.
It appears as if your code is doing something expensive (searching and replacing) many different times.
I propose to do it all at once.
Benchmark:
txt = ExampleData[{"Text", "AeneidEnglish"}];
somewords = DictionaryLookup[RegularExpression["[A-Z][a-z]+"]];
AbsoluteTiming[txt /. Highlight[somewords, Style[#, Blue] &]][[1]]
82.7385
Set up:
txt = ExampleData[{"Text", "AeneidEnglish"}];
somewords = DictionaryLookup[RegularExpression["[A-Z][a-z]+"]];
Generate your rules:
rl = Flatten[# -> Style[#, Blue, Bold]] & /@ somewords;
Put rules on Virgil's Epic:
a = Row[{##}] & @@ StringReplace[txt , rl];
Second benchmark:
AbsoluteTiming[
rl = Flatten[# -> Style[#, Blue, Bold]] & /@ somewords;
a = Row[{##}] & @@ StringReplace[txt , rl];][[1]]
2.4377
Export:
Export["a.rtf", a]