What is word splitting? Why is it important in shell programming?
Early shells had only a single data type: strings. But it is common to manipulate lists of strings, typically when passing multiple file names as arguments to a program. Another common use case for splitting is when a command outputs a list of results: the command's output is a string, but the desired data is a list of strings. To store a list of file names in a variable, you would put spaces between them. Then a shell script like this
files="foo bar qux"
myprogram $files
called myprogram
with three arguments, as the shell split the string $files
into words. At the time, spaces in file names were either forbidden or widely considered Not Done.
The Korn shell introduced arrays: you could store a list of strings in a variable. The Korn shell remained compatible with the then-established Bourne shell, so bare variable expansions kept undergoing word splitting, and using arrays required some syntactic overhead. You would write the snippet above
files=(foo bar qux)
myprogram "${files[@]}"
Zsh had arrays from the start, and its author opted for a saner language design at the expense of backward compatibility. In zsh (under the default expansion rules) $var
does not perfom word splitting; if you want to store a list of words in a variable, you are meant to use an array; and if you really want word splitting, you can write $=var
.
files=(foo bar qux)
myprogram $files
These days, spaces in file names are something you need to cope with, both because many users expect them to work and because many scripts are executed in security-sensitive contexts where an attacker may be in control of file names. So automatic word splitting is often a nuisance; hence my general advice to always use double quotes, i.e. write "$foo"
, unless you understand why you need word splitting in a particular use case. (Note that bare variable expansions undergo globbing as well.)
In this specific case of Zsh, word splitting is defined slightly differently than field splitting.
Consider prog a b c
, it will pass in three arguments no matter how you set IFS
. This is word splitting.
If you do A="a b c"; prog $A
, it will pass in three arguments if IFS
includes space or one argument otherwise. This is field splitting.
Definitions here are subtle. What the Zsh document is trying to say is that, even if you disable that option, prog a b c
will still get separate arguments (which is what people always expect).
Word splitting is not really shell specific.
Most programs that need to parse text input use some form of word splitting as a first step. It is done before identifying from these "words", numbers, operators, strings, tokens and whatever similar entities they need to process.
What is specific with the shells is that they have to properly build the argument list of commands called (C argc/argv, python sys.argv), including passing arguments with embedded spaces, empty arguments, custom delimiters and so on. Many shells use the IFS variable to allow some flexibility there.