Optimize shell script for multiple sed replacements

You can use sed to produce correctly -formatted sed input:

sed -e 's/^/s|/; s/$/|g/' replacement_list | sed -r -f - file

I recently benchmarked various string replacement methods, among them a custom program, sed -e, perl -lnpe and an probably not that widely known MySQL command line utility, replace. replace being optimized for string replacements was almost an order of magnitude faster than sed. The results looked something like this (slowest first):

custom program > sed > LANG=C sed > perl > LANG=C perl > replace

If you want performance, use replace. To have it available on your system, you'll need to install some MySQL distribution, though.

From replace.c:

Replace strings in textfile

This program replaces strings in files or from stdin to stdout. It accepts a list of from-string/to-string pairs and replaces each occurrence of a from-string with the corresponding to-string. The first occurrence of a found string is matched. If there is more than one possibility for the string to replace, longer matches are preferred before shorter matches.

...

The programs make a DFA-state-machine of the strings and the speed isn't dependent on the count of replace-strings (only of the number of replaces). A line is assumed ending with \n or \0. There are no limit exept memory on length of strings.


More on sed. You can utilize multiple cores with sed, by splitting your replacements into #cpus groups and then pipe them through sed commands, something like this:

$ sed -e 's/A/B/g; ...' file.txt | \
  sed -e 's/B/C/g; ...' | \
  sed -e 's/C/D/g; ...' | \
  sed -e 's/D/E/g; ...' > out

Also, if you use sed or perl and your system has an UTF-8 setup, then it also boosts performance to place a LANG=C in front of the commands:

$ LANG=C sed ...

Tags:

Shell

Bash

Sed