How do I differentiate between uppercase and lowercase characters in a case statement?
A simple answer, one which no doubt others can supersede.
The character set ordering is now different depending on which locale is in use. The concept of locale was introduced to support different nationalities and their different languages. As you can see from the output of locale
there are several different areas now addressed - not just collation.
In your case it's US, and for sorting and collation purposes the alphabet is either AaBbCc...Zz or A=a, B=b, C=c, etc.(I forget which, and I'm not at a computer where I can verify one over the other). Locales are very complicated, and in certain locales there can be characters that are invisible as far as sorting and collation are concerned. The same character can sort differently depending on which locale is in use.
As you've found, the correct way to identify lowercase characters is with [[:lower:]]
; this will include accented characters where necessary, and even lowercase characters in different alphabets (Greek, Cyrillic, etc.).
If you want the classic ordering you can revert per application or even per command by setting LC_ALL=C
. For a contrived example,
grep some_pattern | LC_ALL=C sort | nl
There has been an enduring battle between dictionary order and ASCII order.
For a long time.
From the point of view of Unicode, characters should be sorted by local customs in their dictionary order, thus a A b B ... for American letters (ASCII letters). That is usually matched by the [a-zA-Z] range in the en_US.utf-8 locale. Internationalization usually agree with this.
From the point of view of programmers, due to the C language, the [a-z] should match only the ascii characters from 97 up to 122 as one byte value. Similarly for [A-Z]. That usually will match the C language definition of a character as one byte. Some script writers want to use this definition.
That battle has moved from one interpretation to the other from time to time.
Sometimes the [a-z] range becomes only abcdefghijklmnopqrstuvwxyz
.
Sometimes it shift to aAbBcCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYz
.
Or to some other quite more complex list.
The details are complex. The history is long. The battle is still raging.
So, you may get (testing the string book
):
- "your string begins with a Capital Letter" for bash versions 2, 3 and 4 and
- "your string begins with a lowercase letter" for bash version 5 (and 1)
- Most shells will report that as a "lowercase letter".
If you test the string úber
(in the en_US.UTF-8), you will get:
- "lowercase" in ksh/ATT-sh
- "Not an English Letter" in dash, zsh, bash 5.0+ or [lm]ksh.
- "Capital Letter" in bash 2,3, and 4.
As well as the string Úber
.
So, the result is varied.
You could also set LC_ALL=C to enforce the interpretation that a-z
are only lowercase letters (and A-Z
are only Uppercase letters). That will freeze the collation used to only the one from C
. No change if the locale change. A more robust script, but a less adaptable script.
There is also the option to use [[:lower:]]
but, again, that is warranted to be the ASCII range a-z only in the C locale. It may get enforced to all locales in future versions of POSIX (but not yet published 2020).
All considered, the only safe way to ensure that no external decision (from a shell developer of Unix specification) will change the range of your code will be:
# practicing case statements
echo "enter a string"
read yourstring
echo -e "your string is $yourstring\n"
low='abcdefghijklmnopqrstuvwxyz'
cap='ABCDEFGHIJKLMNOPQRSTUVWXYZ'
case "$yourstring" in
[$cap]* ) echo "your string begins with a Capital Letter" ;;
[$low]* ) echo "your string begins with a lowercase letter" ;;
*) echo "your string did not begin with an English letter" ;;
esac