Specify the sort order with LC_COLLATE so lowercase is before uppercase

I don't know of any locales that, by default, sort in that order. The solution is to create a custom locale with a customized sort order. If anyone, four years later, wants to sort in a custom fashion, here's the trick.

The vast majority of locales don't specify their own sort order, but rather copy the sort order defined in /usr/share/i18n/locales/iso14651_t1_common so that is what you will want to edit. Rather than change the sort order for nearly every locale by modifying the original iso14651_t1_common, I suggest you make a copy. Details about how the sort order works and how to create a custom locale in your $HOME directory without root access are found in this answer to a similar question.

Take a look at how a and A are ordered based on their entries in iso14651_t1_common:

<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a
<U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A

b and B are similar:

<U0062> <b>;<BAS>;<MIN>;IGNORE # 233 b
<U0042> <b>;<BAS>;<CAP>;IGNORE # 550 B

We see that on the first pass, both a and A have the collating symbol <a>, while both b and B have the collating symbol <b>. Since <a> appears before <b> in iso14651_t1_common, a and A are tied before b and B. The second pass doesn't break the ties because all four characters have the collating symbol <BAS>, but during the third pass the ties are resolved because the collating symbol for lowercase letters <MIN> appears on line 3467, before the collating symbol for uppercase letters <CAP> (line 3488). So the sort order ends up as a, A, b, B.

Swapping the first and third collating symbols would sort letters first by case (lower then upper), then by accent (<BAS> means non-accented), then by alphabetical order. However, both <MIN> and <CAP> come before the numeric digits, so this would have the unwanted effect of putting digits after letters.

The easiest way to keep digits first while making all lowercase letters come before all uppercase letters is to force all letters to tie during the first comparison by setting them all equal to <a>. To make sure that they sort alphabetically within case, change the last collating symbol from IGNORE to the current first collating symbol. Following this pattern, a would become:

<U0061> <a>;<BAS>;<MIN>;<a> # 198 a

A would become:

<U0041> <a>;<BAS>;<CAP>;<a> # 517 A

b would become:

<U0062> <a>;<BAS>;<MIN>;<b> # 233 b

B would become:

<U0042> <a>;<BAS>;<CAP>;<b> # 550 B

and so on for the rest of the letters.

Once you have created a customized version of iso14651_t1_common, follow the instructions in the answer linked above to compile your custom locale.


Setting LC_COLLATE=C is not always sufficient to sort uppercase before lowercase. You may need to set LC_ALL=C.

That will also take into account non-alphanumeric and even non-printable characters, but if you don't want that there are options -d and -i (described in man sort) to will turn that off.

It will probably fail badly with multibyte input though, such as UTF-8 with non-ASCII characters.

To get lowercase (in order) before uppercase (in order), the best way I can think of that doesn't involve breaking out a full-fledged programming language is inverting the case of all the letters before the sort, and inverting them back afterwards.

tr 'a-zA-Z' 'A-Za-z' < file | LC_ALL=C sort | tr 'a-zA-Z' 'A-Za-z'

I'm no expert but I have never seen locale that defines collation like this. AFAIK this collation is only in C where it is based on ASCII values. (Normally I would just solve this by a script.)

However, I have never done this but you might want to look at localedef(1) and locale(5) manpages to get understanding of how locales are defined and eventually define your own one.

Also don't forget that if there are any diacritics or special characters, C locale will not treat them as you might want to. For example, it will not put á near a or Ł near L. In such cases, the language's native locale would be probably a better starting point.

Tags:

Linux

Sort