Does the Bash star * wildcard always produce an (ascending) sorted list?
In all shells, globs are sorted by default. They were already by the /etc/glob
helper called by Ken Thompson's shell to expand globs in the first version of Unix in the early 70s (and which gave globs their name).
For sh
, POSIX does require them to be sorted by way of strcoll()
, that is using the sorting order in the user's locale, like for ls
though some still do it via strcmp()
, that is based on byte values only.
$ dash -c 'echo *'
Log01B log-0D log00 log01 log02 log0A log0B log0C log4E log4F log50 log① log② lóg01
$ bash -c 'echo *'
log① log② log00 log01 lóg01 Log01B log02 log0A log0B log0C log-0D log4E log4F log50
$ zsh -c 'echo *'
log① log② log00 log01 lóg01 Log01B log02 log0A log0B log0C log-0D log4E log4F log50
$ ls
log② log① log00 log01 lóg01 Log01B log02 log0A log0B log0C log-0D log4E log4F log50
$ ls | sort
log②
log①
log00
log01
lóg01
Log01B
log02
log0A
log0B
log0C
log-0D
log4E
log4F
log50
You may notice above that for those shells that do sorting based on locale, here on a GNU system with a en_GB.UTF-8
locale, the -
in the file names is ignored for sorting (most punctuation characters would). The ó
is sorted in a more expected way (at least to British people), and case is ignored (except when it comes to decide ties).
However, you'll notice some inconsistencies for log① log②. That's because the sorting order of ① and ② is not defined in GNU locales (currently; hopefully it will be fixed some day). They sort the same, so you get random results.
Changing the locale will affect the sorting order. You can set the locale to C to get a strcmp()
-like sort:
$ bash -c 'echo *'
log① log② log00 log01 lóg01 Log01B log02 log0.2 log0A log0B log0C log-0D log4E log4F log50
$ bash -c 'LC_ALL=C; echo *'
Log01B log-0D log0.2 log00 log01 log02 log0A log0B log0C log4E log4F log50 log① log② lóg01
Note that some locales can cause some confusions even for all-ASCII all-alnum strings. Like Czech ones (on GNU systems at least) where ch
is a collating element that sorts after h
:
$ LC_ALL=cs_CZ.UTF-8 bash -c 'echo *'
log0Ah log0Bh log0Dh log0Ch
Or, as pointed out by @ninjalj, even weirder ones in Hungarian locales:
$ LC_ALL=hu_HU.UTF-8 bash -c 'echo *'
logX LOGx LOGX logZ LOGz LOGZ logY LOGY LOGy
In zsh
, you can choose the sorting with glob qualifiers. For instance:
echo *(om) # to sort by modification time
echo *(oL) # to sort by size
echo *(On) # for a *reverse* sort by name
echo *(o+myfunction) # sort using a user-defined function
echo *(N) # to NOT sort
echo *(n) # sort by name, but numerically, and so on.
The numeric sort of echo *(n)
can also be enabled globally with the numericglobsort
option:
$ zsh -c 'echo *'
log① log② log00 log01 lóg01 Log01B log02 log0.2 log0A log0B log0C log-0D log4E log4F log50
$ zsh -o numericglobsort -c 'echo *'
log① log② log00 lóg01 Log01B log0.2 log0A log0B log0C log01 log02 log-0D log4E log4F log50
If you (as I was) are confused by that order in that particular instance (here using my British locale), see here for details.
The man page for bash does specify:
Pathname Expansion
After word splitting, unless the
-f
option has been set, bash scans each word for the characters*
,?
, and[
. If one of these characters appears, then the word is regarded as a pattern, and replaced with an alphabetically sorted list of filenames matching the pattern […].
Unless you trigger some very specific shell options in some shells, the output is guaranteed to be the same.
The order is specified in the POSIX standard:
If the pattern matches any existing filenames or pathnames, the pattern shall be replaced with those filenames and pathnames, sorted according to the collating sequence in effect in the current locale. If this collating sequence does not have a total ordering of all characters (see XBD LC_COLLATE), any filenames or pathnames that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.
See also LC_COLLATE Category in the POSIX Locale, which in short says that if LC_COLLATE=C
, then things are ordered in ASCII order.
The bash
manual mentions
LC_COLLATE
This variable determines the collation order used when sorting the results of pathname expansion, and determines the behavior of range expressions, equivalence classes, and collating sequences within pathname expansion and pattern matching.
ksh93
and zsh
has a similar wording, which leads me to believe that they follow the POSIX standard in this regard.
Other shells, like pdksh
and dash
does not say anything about the sorting of the filenames resulting from filename globbing. I'm tempted to believe that this means that they still adhere to the same standard, at least when using the POSIX locale. In my experience, I have not come across a shell that does any overtly "strange" sorting of ASCII filenames.