Bash script; optimization of processing speed
The first rule of optimization is: don't optimize. Test first. If the tests show that your program is too slow, look for possible optimizations.
The only way to be sure is to benchmark for your use case. There are some general rules, but they only apply for typical volumes of data in typical applications.
Some general rules which may or may not be true in any particular circumstance:
- For internal processing in the shell, ATT ksh is fastest. If you do a lot of string manipulations, use ATT ksh. Dash comes second; bash, pdksh and zsh lag behind.
- If you need to invoke a shell frequently to perform a very short task each time, dash wins because of its low startup time.
- Starting an external process costs time, so it's faster to have one pipeline with complex pieces than a pipeline in a loop.
echo $foo
is slower thanecho "$foo"
, because with no double quotes, it splits$foo
into words and interprets each word as a filename wildcard pattern. More importantly, that splitting and globbing behavior is rarely desired. So remember to always put double quotes around variable substitutions and command substitutions:"$foo"
,"$(foo)"
.- Dedicated tools tend to win over general-purpose tools. For example, tools like
cut
orhead
can be emulated withsed
, butsed
will be slower andawk
will be even slower. Shell string processing is slow, but for short strings it largely beats calling an external program. - More advanced languages such as Perl, Python, and Ruby often let you write faster algorithms, but they have a significantly higher startup time so they're only worth it for performance for large amounts of data.
- On Linux at least, pipes tend to be faster than temporary files.
- Most uses of shell scripting are around I/O-bound processes, so CPU consumption doesn't matter.
It's rare that performance is a concern in shell scripts. The list above is purely indicative; it's perfectly fine to use “slow” methods in most cases as the difference is often a fraction of a percent.
Usually the point of a shell script is to get something done fast. You have to gain a lot from optimization to justify spending extra minutes writing the script.
Shells do not do any reorganization of the code they get handed, it is just interpreted one line after the other (nothing else does much sense in a command interpreter). Much of the time spent by the shell goes to lexical analysis/parsing/launching the programs called.
For simple operations (like the ones munging strings in the examples at the end of the question) I'd be surprised if the time to load the programs don't swamp any minuscule speed differences.
The moral of the story is that if you really need more speed, you are better off with a (semi)compiled language like Perl or Python, which is faster to run to start with, in which you can write many of the operations mentioned directly and don't have to call out to external programs, and has the option to invoke external programs or call into optimized C (or whatever) modules to do much of the job. That is the reason why in Fedora the "system administration sugar" (GUIs, essentially) are written in Python: Can add a nice GUI with not too much effort, fast enough for such applications, have direct access to system calls. If that isn't enough speed, grab C++ or C.
But do not go there, unless you can prove that the performance gain is worth the loss in flexibility and the development time. Shell scripts are not too bad to read, but I shudder when I remember some scripts used to install Ultrix I once tried to decipher. I gave up, too much "shell script optimization" had been applied.
We'll expand here on our globbing example above to illustrate some performance characteristics of the shell script interpreter. Comparing the
bash
anddash
interpreters for this example where a process is spawned for each of 30,000 files, shows that dash can fork thewc
processes nearly twice as fast asbash
bash-4.2$ time dash -c 'for i in *; do wc -l "$i"; done>/dev/null'
real 0m1.238s
user 0m0.309s
sys 0m0.815s
bash-4.2$ time bash -c 'for i in *; do wc -l "$i"; done>/dev/null'
real 0m1.422s
user 0m0.349s
sys 0m0.940s
Comparing the base looping speed by not invoking the
wc
processes, shows that dash's looping is nearly 6 times faster!$ time bash -c 'for i in *; do echo "$i">/dev/null; done' real 0m1.715s user 0m1.459s sys 0m0.252s $ time dash -c 'for i in *; do echo "$i">/dev/null; done' real 0m0.375s user 0m0.169s sys 0m0.203s
The looping is still relatively slow in either shell as demonstrated previously, so for scalability we should try and use more functional techniques so iteration is performed in compiled processes.
$ time find -type f -print0 | wc -l --files0-from=- | tail -n1 30000 total real 0m0.299s user 0m0.072s sys 0m0.221s
The above is by far the most efficient solution and illustrates the point well that one should do as little as possible in shell script and aim just to use it to connect the existing logic available in the rich set of utilities available on a UNIX system.
Stolen From Common shell script mistakes by Pádraig Brady.