Bracket expression (without ranges) matching unexpected character in bash
That's a consequence of those characters having the same sorting order.
You'll also notice that
sort -u << EOF
■
⅕
⅖
⅗
EOF
returns only one line.
Or that:
expr ■ = ⅕
returns true (as required by POSIX).
Most locales shipped with GNU systems have a number of characters (and even sequences of characters (collating sequences)) that have the same sorting order. In the case of those ■⅕⅖⅗ ones, it's because the order is not defined, and those characters whose order is not defined end up having the same sorting order in GNU systems. There are characters that are explicitly defined as having the same sorting order like Ș and Ş (though there's no apparent (to me anyway) real logic or consistency on how it is done).
That is the source of quite surprising and bogus behaviours. I have raised the issue very recently on the Austin group (the body behind POSIX and the Single UNIX Specification) mailing list and the discussion is still ongoing as of 2015-04-03.
In this case, whether [y]
should match x
where x
and y
sort the same is unclear to me, but since a bracket expression is meant to match a collating element, that suggests that the bash
behaviour is expected.
In any case, I suppose [⅕-⅕]
or at least [⅕-⅖]
should match ■
.
You'll notice that different tools behave differently. ksh93 behaves like bash
, GNU grep
or sed
don't. Some other shells have different behaviours some like yash
even more buggy.
To have a consistent behaviour, you need a locale where all characters sort differently. The C locale is the typical one. However the character set in the C locale on most systems is ASCII. On GNU systems, you generally have access to a C.UTF-8
locale that can be used instead to work on UTF-8 character.
So:
(export LC_ALL=C.UTF-8; [[ ■ = [⅕⅖⅗] ]])
or the standard equivalent:
(export LC_ALL=C.UTF-8
case ■ in ([⅕⅖⅗]) true;; (*) false; esac)
should return false.
Another alternative would be to set only LC_COLLATE
to C which would work on GNU systems, but not necessarily on others where it could fail to specify the sorting order of multi-byte character.
One lesson of that is that equality is not as clear a notion as one would expect when it comes to comparing strings. Equality might mean, from strictest to least strict.
- Same number of bytes and all byte constituents have the same value.
- Same number of characters and all characters are the same (for instance, refer to the same codepoint in the current charset).
- The two strings have the same sorting order as per the locale's collation algorithm (that is, neither a < b nor b > a is true).
Now, for 2 or 3, that assumes both strings contain valid characters. In UTF-8 and some other encodings, some sequence of bytes don't form valid characters.
1 and 2 are not necessarily equivalent because of that, or because some characters may have more than one possible encoding. That's typically the case of stateful encodings like ISO-2022-JP where A
can be expressed as 41
or 1b 28 42 41
(1b 28 42
being the sequence to switch to ASCII and you can insert as many of those as you want, that won't make a difference), though I wouldn't expect those types of encoding still being in use, and GNU tools at least generally don't work properly with them.
Also beware that most non-GNU utilities can't deal with the 0 byte value (the NUL character in ASCII).
Which of those definitions is used depends on the utility and utility implementation or version. POSIX is not 100% clear on that. In the C locale, all 3 are equivalent. Outside of that YMMV.