Why does ls sorting ignore non-alphanumeric characters?

EDIT: Added test for data sorted with LC_COLLATE=C


The default collate sequence is treating those "punctuation-type" characters as being of equal value.. Use LC_COLLATE=C to treat them in codepoint order ..

for i in 'a1' 'a_1' 'a-1' 'a,1' 'a.1' 'a2' 'a_2' 'a-2' 'a,2' 'a.2' ;do
  echo $i; 
done |LC_COLLATE=C sort

Output

a,1
a,2
a-1
a-2
a.1
a.2
a1
a2
a_1
a_2

The following code tests all valid UTF-8 chars in the Basic Multilingual Plane (except for \x00 and \x0a; for simplicity)
It compares a file in a known (generated) ascending sequence, against that file randomly sorted and then sorted again with LC_COLLATE=C. The result shows that the C sequence is identical to the original generated sequence.

{ i=0 j=0 k=0 l=0
  for i in {0..9} {A..F} ;do
  for j in {0..9} {A..F} ;do
  for k in {0..9} {A..F} ;do
  for l in {0..9} {A..F} ;do
     (( 16#$i$j$k$l == 16#0000 )) && { printf '.' >&2; continue; }
     (( 16#$i$j$k$l == 16#000A )) && { printf '.' >&2; continue; }
     (( 16#$i$j$k$l >= 16#D800    && 
        16#$i$j$k$l <= 16#DFFF )) && { printf '.' >&2; continue; }
     (( 16#$i$j$k$l >= 16#FFFE )) && { printf '.' >&2; continue; }
     echo 0x"$i$j$k$l" |recode UTF-16BE/x4..UTF-8 || { echo "ERROR at codepoint $i$j$k$l " >&2; continue; } 
     echo 
  done
  done
  done; echo -n "$i$j$k$l " >&2
  done; echo >&2
} >listGen

             sort -R listGen    > listRandom
LC_COLLATE=C sort    listRandom > listCsort 

diff <(cat listGen;   echo "last line of listOrig " ) \
     <(cat listCsort; echo "last line of listCsort" )
echo 
cmp listGen listCsort; echo 'cmp $?='$?

Output:

63485c63485
< last line of listOrig 
---
> last line of listCsort

cmp $?=0

This has nothing to do with the charset. Rather, it's the language that determines the collation order. The libc examines the language presented in $LC_COLLATE/$LC_ALL/$LANG and looks up its collation rules (e.g. /usr/share/i18n/locales/* for GLibC) and orders the text as directed.


I am having exactly the same issue with Debian's default sort options, for me it's a comma that it's ignoring and that's preventing me from sorting CSV data effectively causing havoc in my AI.

The solution is, instead of using sort on it's own, that i need to force sort out of the default behaviour which seems to be -d, --dictionary-order.

Running the command:

sort -V

Fixes my problem and considers commas.

Tags:

Ls

Sort