Bash script; optimization of processing speed

The first rule of optimization is: don't optimize. Test first. If the tests show that your program is too slow, look for possible optimizations.

The only way to be sure is to benchmark for your use case. There are some general rules, but they only apply for typical volumes of data in typical applications.

Some general rules which may or may not be true in any particular circumstance:

For internal processing in the shell, ATT ksh is fastest. If you do a lot of string manipulations, use ATT ksh. Dash comes second; bash, pdksh and zsh lag behind.
If you need to invoke a shell frequently to perform a very short task each time, dash wins because of its low startup time.
Starting an external process costs time, so it's faster to have one pipeline with complex pieces than a pipeline in a loop.
echo $foo is slower than echo "$foo", because with no double quotes, it splits $foo into words and interprets each word as a filename wildcard pattern. More importantly, that splitting and globbing behavior is rarely desired. So remember to always put double quotes around variable substitutions and command substitutions: "$foo", "$(foo)".
Dedicated tools tend to win over general-purpose tools. For example, tools like cut or head can be emulated with sed, but sed will be slower and awk will be even slower. Shell string processing is slow, but for short strings it largely beats calling an external program.
More advanced languages such as Perl, Python, and Ruby often let you write faster algorithms, but they have a significantly higher startup time so they're only worth it for performance for large amounts of data.
On Linux at least, pipes tend to be faster than temporary files.
Most uses of shell scripting are around I/O-bound processes, so CPU consumption doesn't matter.

It's rare that performance is a concern in shell scripts. The list above is purely indicative; it's perfectly fine to use “slow” methods in most cases as the difference is often a fraction of a percent.

Usually the point of a shell script is to get something done fast. You have to gain a lot from optimization to justify spending extra minutes writing the script.

Shells do not do any reorganization of the code they get handed, it is just interpreted one line after the other (nothing else does much sense in a command interpreter). Much of the time spent by the shell goes to lexical analysis/parsing/launching the programs called.

For simple operations (like the ones munging strings in the examples at the end of the question) I'd be surprised if the time to load the programs don't swamp any minuscule speed differences.

The moral of the story is that if you really need more speed, you are better off with a (semi)compiled language like Perl or Python, which is faster to run to start with, in which you can write many of the operations mentioned directly and don't have to call out to external programs, and has the option to invoke external programs or call into optimized C (or whatever) modules to do much of the job. That is the reason why in Fedora the "system administration sugar" (GUIs, essentially) are written in Python: Can add a nice GUI with not too much effort, fast enough for such applications, have direct access to system calls. If that isn't enough speed, grab C++ or C.

But do not go there, unless you can prove that the performance gain is worth the loss in flexibility and the development time. Shell scripts are not too bad to read, but I shudder when I remember some scripts used to install Ultrix I once tried to decipher. I gave up, too much "shell script optimization" had been applied.

We'll expand here on our globbing example above to illustrate some performance characteristics of the shell script interpreter. Comparing the bash and dash interpreters for this example where a process is spawned for each of 30,000 files, shows that dash can fork the wc processes nearly twice as fast as bash

bash-4.2$ time dash -c 'for i in *; do wc -l "$i"; done>/dev/null'
real    0m1.238s
user    0m0.309s
sys     0m0.815s


bash-4.2$ time bash -c 'for i in *; do wc -l "$i"; done>/dev/null'
real    0m1.422s
user    0m0.349s
sys     0m0.940s

Comparing the base looping speed by not invoking the wc processes, shows that dash's looping is nearly 6 times faster!
$ time bash -c 'for i in *; do echo "$i">/dev/null; done'
real    0m1.715s
user    0m1.459s
sys     0m0.252s



$ time dash -c 'for i in *; do echo "$i">/dev/null; done'
real    0m0.375s
user    0m0.169s
sys     0m0.203s
The looping is still relatively slow in either shell as demonstrated previously, so for scalability we should try and use more functional techniques so iteration is performed in compiled processes.
$ time find -type f -print0 | wc -l --files0-from=- | tail -n1
    30000 total
real    0m0.299s
user    0m0.072s
sys     0m0.221s
The above is by far the most efficient solution and illustrates the point well that one should do as little as possible in shell script and aim just to use it to connect the existing logic available in the rich set of utilities available on a UNIX system.

Stolen From Common shell script mistakes by Pádraig Brady.

Bash script; optimization of processing speed

Tags:

Performance

Shell Script

Related

Recent Posts