Why does sort say that ɛ = e?
No, it doesn't consider them as equivalent, they just have the same primary weight. So that, in first approximation, they sort the same.
If you look at /usr/share/i18n/locales/iso14651_t1_common (as used as basis for most locales) on a GNU system (here with glibc 2.27), you'll see:
<U0065> <e>;<BAS>;<MIN>;IGNORE # 259 e
<U025B> <e>;<PCL>;<MIN>;IGNORE # 287 ɛ
<U0045> <e>;<BAS>;<CAP>;IGNORE # 577 E
e
, ɛ
and E
have the same primary weight, e
and E
same secondary weight, only the third weight differentiates them.
When comparing strings, sort
(the strcoll()
standard libc function is uses to compare strings) starts by comparing the primary weights of all characters, and only go for the second weight if the strings are equal with the primary weights (and so on with the other weights).
That's how case seems to be ignored in the sorting order in first approximation. Ab
sorts between aa
and ac
, but Ab
can sort before or after ab
depending on the language rule (some languages have <MIN>
before <CAP>
like in British English, some <CAP>
before <MIN>
like in Estonian).
If e
had the same sorting order as ɛ
, printf '%s\n' e ɛ | sort -u
would return only one line. But as <BAS>
sorts before <PCL>
, e
alone sorts before ɛ
. eɛe
sorts after EEE
(at the secondary weight) even though EEE
sorts after eee
(for which we need to go up to the third weight).
Now if on my system with glibc 2.27, I run:
sed -n 's/\(.*;[^[:blank:]]*\).*/\1/p' /usr/share/i18n/locales/iso14651_t1_common |
sort -k2 | uniq -Df1
You'll notice that there are quite a few characters that have been defined with the exact same 4 weights. In particular, our ɛ has the same weights as:
<U01DD> <e>;<PCL>;<MIN>;IGNORE
<U0259> <e>;<PCL>;<MIN>;IGNORE
<U025B> <e>;<PCL>;<MIN>;IGNORE
And sure enough:
$ printf '%s\n' $'\u01DD' $'\u0259' $'\u025B' | sort -u
ǝ
$ expr ɛ = ǝ
1
That can be seen as a bug of GNU libc locales. On most other systems, locales make sure all different characters have different sorting order in the end. On GNU locales, it gets even worse, as there are thousands of characters that don't have a sorting order and end up sorting the same, causing all sorts of problems (like breaking comm
, join
, ls
or globs having non-deterministic orders...), hence the recommendation of using LC_ALL=C
to work around those issues.
As noted by @ninjalj in comments, glibc 2.28 released in August 2018 came with some improvements on that front though AFAICS, there are still some characters or collating elements defined with identical sorting order. On Ubuntu 18.10 with glibc 2.28 and in a en_GB.UTF-8 locale.
$ expr $'L\ub7' = $'L\u387'
1
(why would U+00B7 be considered equivalent as U+0387 only when combined with L
/l
?!).
And:
$ perl -lC -e 'for($i=0; $i<0x110000; $i++) {$i = 0xe000 if $i == 0xd800; print chr($i)}' | sort > all-chars-sorted
$ uniq -d all-chars-sorted | wc -l
4
$ uniq -D all-chars-sorted | wc -l
1061355
(still over 1 million characters (95% of the Unicode range, down from 98% in 2.27) sorting the same as other characters as their sorting order is not defined).
See also:
- What does "LC_ALL=C" do?
- Generate the collating order of a string
- What is the difference between "sort -u" and "sort | uniq"?
man sort:
*** WARNING *** The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.
So, try: LC_ALL=C sort file.txt
The character ɛ is not equal to e, but some locales can gather these signs close together upon collation. The reason for this is language specific, but also some historical or even political background. For example most people probably expect that €uro currency comes close to Europe in dictionary.
Anyway to see what collation you are currently using run locale
, the locale -a
will give you the list of locales available on the system and to change collation say to C
just for one sorting run LC_COLLATE=C sort file
. Finally to see how different locales can sort your file try
for loc in $(locale -a)
do echo ____"${loc}"____
LC_COLLATE="$loc" sort file
done
Pipe the result to some greping tool to choose locale that fits your need.