Why is using a shell loop to process text considered bad practice?
Yes, we see a number of things like:
while read line; do
echo $line | cut -c3
done
Or worse:
for line in `cat file`; do
foo=`echo $line | awk '{print $2}'`
echo whatever $foo
done
(don't laugh, I've seen many of those).
Generally from shell scripting beginners. Those are naive literal translations of what you would do in imperative languages like C or python, but that's not how you do things in shells, and those examples are very inefficient, completely unreliable (potentially leading to security issues), and if you ever manage to fix most of the bugs, your code becomes illegible.
Conceptually
In C or most other languages, building blocks are just one level above computer instructions. You tell your processor what to do and then what to do next. You take your processor by the hand and micro-manage it: you open that file, you read that many bytes, you do this, you do that with it.
Shells are a higher level language. One may say it's not even a language. They're before all command line interpreters. The job is done by those commands you run and the shell is only meant to orchestrate them.
One of the great things that Unix introduced was the pipe and those default stdin/stdout/stderr streams that all commands handle by default.
In 50 years, we've not found better than that API to harness the power of commands and have them cooperate to a task. That's probably the main reason why people are still using shells today.
You've got a cutting tool and a transliterate tool, and you can simply do:
cut -c4-5 < in | tr a b > out
The shell is just doing the plumbing (open the files, setup the pipes, invoke the commands) and when it's all ready, it just flows without the shell doing anything. The tools do their job concurrently, efficiently at their own pace with enough buffering so as not one blocking the other, it's just beautiful and yet so simple.
Invoking a tool though has a cost (and we'll develop that on the performance point). Those tools may be written with thousands of instructions in C. A process has to be created, the tool has to be loaded, initialised, then cleaned-up, process destroyed and waited for.
Invoking cut
is like opening the kitchen drawer, take the knife, use it, wash it, dry it, put it back in the drawer. When you do:
while read line; do
echo $line | cut -c3
done < file
It's like for each line of the file, getting the read
tool from the kitchen drawer (a very clumsy one because it's not been designed for that), read a line, wash your read tool, put it back in the drawer. Then schedule a meeting for the echo
and cut
tool, get them from the drawer, invoke them, wash them, dry them, put them back in the drawer and so on.
Some of those tools (read
and echo
) are built in most shells, but that hardly makes a difference here since echo
and cut
still need to be run in separate processes.
It's like cutting an onion but washing your knife and put it back in the kitchen drawer between each slice.
Here the obvious way is to get your cut
tool from the drawer, slice your whole onion and put it back in the drawer after the whole job is done.
IOW, in shells, especially to process text, you invoke as few utilities as possible and have them cooperate to the task, not run thousands of tools in sequence waiting for each one to start, run, clean up before running the next one.
Further reading in Bruce's fine answer. The low-level text processing internal tools in shells (except maybe for zsh
) are limited, cumbersome, and generally not fit for general text processing.
Performance
As said earlier, running one command has a cost. A huge cost if that command is not builtin, but even if they are builtin, the cost is big.
And shells have not been designed to run like that, they have no pretension to being performant programming languages. They are not, they're just command line interpreters. So, little optimisation has been done on this front.
Also, the shells run commands in separate processes. Those building blocks don't share a common memory or state. When you do a fgets()
or fputs()
in C, that's a function in stdio. stdio keeps internal buffers for input and output for all the stdio functions, to avoid to do costly system calls too often.
The corresponding even builtin shell utilities (read
, echo
, printf
) can't do that. read
is meant to read one line. If it reads past the newline character, that means the next command you run will miss it. So read
has to read the input one byte at a time (some implementations have an optimisation if the input is a regular file in that they read chunks and seek back, but that only works for regular files and bash
for instance only reads 128 byte chunks which is still a lot less than text utilities will do).
Same on the output side, echo
can't just buffer its output, it has to output it straight away because the next command you run will not share that buffer.
Obviously, running commands sequentially means you have to wait for them, it's a little scheduler dance that gives control from the shell and to the tools and back. That also means (as opposed to using long running instances of tools in a pipeline) that you cannot harness several processors at the same time when available.
Between that while read
loop and the (supposedly) equivalent cut -c3 < file
, in my quick test, there's a CPU time ratio of around 40000 in my tests (one second versus half a day). But even if you use only shell builtins:
while read line; do
echo ${line:2:1}
done
(here with bash
), that's still around 1:600 (one second vs 10 minutes).
Reliability/legibility
It's very hard to get that code right. The examples I gave are seen too often in the wild, but they have many bugs.
read
is a handy tool that can do many different things. It can read input from the user, split it into words to store in different variables. read line
does not read a line of input, or maybe it reads a line in a very special way. It actually reads words from the input those words separated by $IFS
and where backslash can be used to escape the separators or the newline character.
With the default value of $IFS
, on an input like:
foo\/bar \
baz
biz
read line
will store "foo/bar baz"
into $line
, not " foo\/bar \"
as you'd expect.
To read a line, you actually need:
IFS= read -r line
That's not very intuitive, but that's the way it is, remember shells were not meant to be used like that.
Same for echo
. echo
expands sequences. You can't use it for arbitrary contents like the content of a random file. You need printf
here instead.
And of course, there's the typical forgetting of quoting your variable which everybody falls into. So it's more:
while IFS= read -r line; do
printf '%s\n' "$line" | cut -c3
done < file
Now, a few more caveats:
- except for
zsh
, that doesn't work if the input contains NUL characters while at least GNU text utilities would not have the problem. - if there's data after the last newline, it will be skipped
- inside the loop, stdin is redirected so you need to pay attention that the commands in it don't read from stdin.
- for the commands within the loops, we're not paying attention to whether they succeed or not. Usually, error (disk full, read errors...) conditions will be poorly handled, usually more poorly than with the correct equivalent.
If we want to address some of those issues above, that becomes:
while IFS= read -r line <&3; do
{
printf '%s\n' "$line" | cut -c3 || exit
} 3<&-
done 3< file
if [ -n "$line" ]; then
printf '%s' "$line" | cut -c3 || exit
fi
That's becoming less and less legible.
There are a number of other issues with passing data to commands via the arguments or retrieving their output in variables:
- the limitation on the size of arguments (some text utility implementations have a limit there as well, though the effect of those being reached are generally less problematic)
- the NUL character (also a problem with text utilities).
- arguments taken as options when they start with
-
(or+
sometimes) - various quirks of various commands typically used in those loops like
expr
,test
... - the (limited) text manipulation operators of various shells that handle multi-byte characters in inconsistent ways.
- ...
Security considerations
When you start working with shell variables and arguments to commands, you're entering a mine-field.
If you forget to quote your variables, forget the end of option marker, work in locales with multi-byte characters (the norm these days), you're certain to introduce bugs which sooner or later will become vulnerabilities.
When you may want to use loops.
TBD
As far as conceptual and legibility goes, shells typically are interested in files. Their "addressable unit" is the file, and the "address" is the file name. Shells have all kinds of methods of testing for file existence, file type, file name formatting (beginning with globbing). Shells have very few primitives for dealing with file contents. Shell programmers have to invoke another program to deal with file contents.
Because of the file and file name orientation, doing text manipulation in the shell is really slow, as you've noted, but also requires an unclear and contorted programming style.
There are some complicated answers, giving a lot of interesting details for the geeks among us, but it's really quite simple - processing a large file in a shell loop is just too slow.
I think the questioner is interesting in a typical kind of shell script, which may start with some command-line parsing, environment setting, checking files and directories, and a bit more initialization, before getting on to its main job: going through a large line-oriented text file.
For the first parts (initialization
), it doesn't usually matter that shell commands are slow - it's only running a few dozen commands, maybe with a couple of short loops.
Even if we write that part inefficiently, it's usually going to take less than a second to do all that initialization, and that's fine - it only happens once.
But when we get on to processing the big file, which could have thousands or millions of lines, it is not fine for the shell script to take a significant fraction of a second (even if it's only a few dozen milliseconds) for each line, as that could add up to hours.
That's when we need to use other tools, and the beauty of Unix shell scripts is that they make it very easy for us to do that.
Instead of using a loop to look at each line, we need to pass the whole file through a pipeline of commands. This means that, instead of calling the commands thousands or millions of time, the shell calls them only once. It's true that those commands will have loops to process the file line-by-line, but they are not shell scripts and they are designed to be fast and efficient.
Unix has many wonderful built in tools, ranging from the simple to the complex, that we can use to build our pipelines. I would usually start with the simple ones, and only use more complex ones when necessary.
I would also try to stick with standard tools that are available on most systems, and try to keep my usage portable, although that's not always possible. And if your favourite language is Python or Ruby, maybe you won't mind the extra effort of making sure it's installed on every platform your software needs to run on :-)
Simple tools include head
, tail
, grep
, sort
, cut
, tr
, sed
, join
(when merging 2 files), and awk
one-liners, among many others.
It's amazing what some people can do with pattern-matching and sed
commands.
When it gets more complex, and you really have to apply some logic to each line, awk
is a good option - either a one-liner (some people put whole awk scripts in 'one line', although that's not very readable) or in a short external script.
As awk
is an interpreted language (like your shell), it's amazing that it can do line-by-line processing so efficiently, but it's purpose-built for this and it's really very fast.
And then there's Perl
and a huge number of other scripting languages that are very good at processing text files, and also come with lots of useful libraries.
And finally, there's good old C, if you need maximum speed and high flexibility (although text processing is a bit tedious). But it's probably a very bad use of your time to write a new C program for every different file-processing task you come across. I work with CSV files a lot, so I have written several generic utilities in C that I can re-use in many different projects. In effect, this expands the range of 'simple, fast Unix tools' that I can call from my shell scripts, so I can handle most projects by only writing scripts, which is much faster than writing and debugging bespoke C code each time!
Some final hints:
- don't forget to start your main shell script with
export LANG=C
, or many tools will treat your plain-old-ASCII files as Unicode, making them much much slower - also consider setting
export LC_ALL=C
if you wantsort
to produce consistent ordering, regardless of the environment! - if you need to
sort
your data, that will probably take more time (and resources: CPU, memory, disk) than everything else, so try to minimize the number ofsort
commands and the size of the files they're sorting - a single pipeline, when possible, is usually most efficient - running multiple pipelines in sequence, with intermediate files, may be more readable and debug-able, but will increase the time that your program takes