Command to retrieve the list of characters in a given character class in the current locale
POSSIBLE FINAL SOLUTION
So I've taken all of the below information and come up with this:
for class in $(
locale -v LC_CTYPE |
sed 's/combin.*//;s/;/\n/g;q'
) ; do
printf "\n\t%s\n\n" $class
recode u2/test16 -q </dev/null |
tr -dc "[:$class:]" |
od -A n -t a -t o1z -w12
done
NOTE:
I use od
as the final filter above for preference and because I know I won't be working with multi-byte characters, which it will not correctly handle. recode u2..dump
will both generate output more like that specified in the question and handle wide characters correctly.
OUTPUT
upper
A B C D E F G H I J K L
101 102 103 104 105 106 107 110 111 112 113 114 >ABCDEFGHIJKL<
M N O P Q R S T U V W X
115 116 117 120 121 122 123 124 125 126 127 130 >MNOPQRSTUVWX<
Y Z
131 132 >YZ<
lower
a b c d e f g h i j k l
141 142 143 144 145 146 147 150 151 152 153 154 >abcdefghijkl<
m n o p q r s t u v w x
155 156 157 160 161 162 163 164 165 166 167 170 >mnopqrstuvwx<
y z
171 172 >yz<
alpha
A B C D E F G H I J K L
101 102 103 104 105 106 107 110 111 112 113 114 >ABCDEFGHIJKL<
M N O P Q R S T U V W X
115 116 117 120 121 122 123 124 125 126 127 130 >MNOPQRSTUVWX<
Y Z a b c d e f g h i j
131 132 141 142 143 144 145 146 147 150 151 152 >YZabcdefghij<
k l m n o p q r s t u v
153 154 155 156 157 160 161 162 163 164 165 166 >klmnopqrstuv<
w x y z
167 170 171 172 >wxyz<
digit
0 1 2 3 4 5 6 7 8 9
060 061 062 063 064 065 066 067 070 071 >0123456789<
xdigit
0 1 2 3 4 5 6 7 8 9 A B
060 061 062 063 064 065 066 067 070 071 101 102 >0123456789AB<
C D E F a b c d e f
103 104 105 106 141 142 143 144 145 146 >CDEFabcdef<
space
ht nl vt ff cr sp
011 012 013 014 015 040 >..... <
print
sp ! " # $ % & ' ( ) * +
040 041 042 043 044 045 046 047 050 051 052 053 > !"#$%&'()*+<
, - . / 0 1 2 3 4 5 6 7
054 055 056 057 060 061 062 063 064 065 066 067 >,-./01234567<
8 9 : ; < = > ? @ A B C
070 071 072 073 074 075 076 077 100 101 102 103 >89:;<=>?@ABC<
D E F G H I J K L M N O
104 105 106 107 110 111 112 113 114 115 116 117 >DEFGHIJKLMNO<
P Q R S T U V W X Y Z [
120 121 122 123 124 125 126 127 130 131 132 133 >PQRSTUVWXYZ[<
\ ] ^ _ ` a b c d e f g
134 135 136 137 140 141 142 143 144 145 146 147 >\]^_`abcdefg<
h i j k l m n o p q r s
150 151 152 153 154 155 156 157 160 161 162 163 >hijklmnopqrs<
t u v w x y z { | } ~
164 165 166 167 170 171 172 173 174 175 176 >tuvwxyz{|}~<
graph
! " # $ % & ' ( ) * + ,
041 042 043 044 045 046 047 050 051 052 053 054 >!"#$%&'()*+,<
- . / 0 1 2 3 4 5 6 7 8
055 056 057 060 061 062 063 064 065 066 067 070 >-./012345678<
9 : ; < = > ? @ A B C D
071 072 073 074 075 076 077 100 101 102 103 104 >9:;<=>?@ABCD<
E F G H I J K L M N O P
105 106 107 110 111 112 113 114 115 116 117 120 >EFGHIJKLMNOP<
Q R S T U V W X Y Z [ \
121 122 123 124 125 126 127 130 131 132 133 134 >QRSTUVWXYZ[\<
] ^ _ ` a b c d e f g h
135 136 137 140 141 142 143 144 145 146 147 150 >]^_`abcdefgh<
i j k l m n o p q r s t
151 152 153 154 155 156 157 160 161 162 163 164 >ijklmnopqrst<
u v w x y z { | } ~
165 166 167 170 171 172 173 174 175 176 >uvwxyz{|}~<
blank
ht sp
011 040 >. <
cntrl
nul soh stx etx eot enq ack bel bs ht nl vt
000 001 002 003 004 005 006 007 010 011 012 013 >............<
ff cr so si dle dc1 dc2 dc3 dc4 nak syn etb
014 015 016 017 020 021 022 023 024 025 026 027 >............<
can em sub esc fs gs rs us del
030 031 032 033 034 035 036 037 177 >.........<
punct
! " # $ % & ' ( ) * + ,
041 042 043 044 045 046 047 050 051 052 053 054 >!"#$%&'()*+,<
- . / : ; < = > ? @ [ \
055 056 057 072 073 074 075 076 077 100 133 134 >-./:;<=>?@[\<
] ^ _ ` { | } ~
135 136 137 140 173 174 175 176 >]^_`{|}~<
alnum
0 1 2 3 4 5 6 7 8 9 A B
060 061 062 063 064 065 066 067 070 071 101 102 >0123456789AB<
C D E F G H I J K L M N
103 104 105 106 107 110 111 112 113 114 115 116 >CDEFGHIJKLMN<
O P Q R S T U V W X Y Z
117 120 121 122 123 124 125 126 127 130 131 132 >OPQRSTUVWXYZ<
a b c d e f g h i j k l
141 142 143 144 145 146 147 150 151 152 153 154 >abcdefghijkl<
m n o p q r s t u v w x
155 156 157 160 161 162 163 164 165 166 167 170 >mnopqrstuvwx<
y z
PROGRAMMER'S API
As I demonstrate below, recode
will provide you your complete character map. According to its manual, it does this according first to the current value of the DEFAULT_CHARSET
environment variable, or, failing that, it operates exactly as you specify:
When a charset name is omitted or left empty, the value of the
DEFAULT_CHARSET
variable in the environment is used instead. If this variable is not defined, therecode
library uses the current locale's encoding. On POSIX compliant systems, this depends on the first non-empty value among the environment variablesLC_ALL, LC_CTYPE, LANG
and can be determined through the commandlocale charmap.
Also worth noting about recode
is that it is an api:
The program named
recode
is just an application of its recoding library. The recoding library is available separately for other C programs. A good way to acquire some familiarity with the recoding library is to get acquainted with therecode
program itself.To use the recoding library once it is installed, a C program needs to have a line:
#include <recode.h>
For internationally-friendly string comparison The POSIX
and C
standards define the strcoll()
function:
The
strcoll()
function shall compare the string pointed to bys1
to the string pointed to bys2
, both interpreted as appropriate to the LC_COLLATE category of the current locale.The
strcoll()
function shall not change the setting of errno if successful.Since no return value is reserved to indicate an error, an application wishing to check for error situations should set errno to 0, then call
strcoll()
, then check errno.
Here is a separately located example of its usage:
#include <stdio.h>
#include <string.h>
int main ()
{
char str1[15];
char str2[15];
int ret;
strcpy(str1, "abc");
strcpy(str2, "ABC");
ret = strcoll(str1, str2);
if(ret > 0)
{
printf("str1 is less than str2");
}
else if(ret < 0)
{
printf("str2 is less than str1");
}
else
{
printf("str1 is equal to str2");
}
return(0);
}
Regarding the POSIX
character classes, you've already noted you used the C
API to find these. For unicode character and classes you can use recode's
dump-with-names charset to get the desired output. From its manual again:
For example, the command
recode l2..full < input
implies a necessary conversion from Latin-2 to UCS-2, as dump-with-names is only connected out from UCS-2. In such cases,recode
does not display the original Latin-2 codes in the dump, only the corresponding UCS-2 values. To give a simpler example, the command
echo 'Hello, world!' | recode us..dump
produces the following output:
UCS2 Mne Description
0048 H latin capital letter h
0065 e latin small letter e
006C l latin small letter l
006C l latin small letter l
006F o latin small letter o
002C , comma
0020 SP space
0077 w latin small letter w
006F o latin small letter o
0072 r latin small letter r
006C l latin small letter l
0064 d latin small letter d
0021 ! exclamation mark
000A LF line feed (lf)
The descriptive comment is given in English and ASCII, yet if the English description is not available but a French one is, then the French description is given instead, using Latin-1. However, if the
LANGUAGE
orLANG
environment variable begins with the letters fr, then listing preference goes to French when both descriptions are available.
Using similar syntax to the above combined with its included test dataset I can get my own character map with:
recode -q u8/test8..dump </dev/null
OUTPUT
UCS2 Mne Description
0001 SH start of heading (soh)
0002 SX start of text (stx)
0003 EX end of text (etx)
...
002B + plus sign
002C , comma
002D - hyphen-minus
...
0043 C latin capital letter c
0044 D latin capital letter d
0045 E latin capital letter e
...
006B k latin small letter k
006C l latin small letter l
006D m latin small letter m
...
007B (! left curly bracket
007C !! vertical line
007D !) right curly bracket
007E '? tilde
007F DT delete (del)
But for common characters, recode
is apparently not necessary. This should give you named chars for everything in 128-byte charset:
printf %b "$(printf \\%04o $(seq 128))" |
luit -c |
od -A n -t o1z -t a -w12
OUTPUT
001 002 003 004 005 006 007 010 011 012 013 014 >............<
soh stx etx eot enq ack bel bs ht nl vt ff
...
171 172 173 174 175 176 177 >yz{|}~.<
y z { | } ~ del
Of course, only 128-bytes are represented, but that's because my locale, utf-8 charmaps or not, uses the ASCII charset and nothing more. So that's all I get. If I ran it without luit
filtering it though, od
would would roll it back around and print the same map again up to \0400.
There are two major problems with the above method, though. First there is the system's collation order - for non-ASCII locales the bite values for the charsets are not simply in seq
uence, which, as I think , is likely the core of the problem you're trying to solve.
Well, GNU tr's man
page states that it will expand the [:upper:]
[:lower:]
classes in order - but that's not a lot.
I imagine some heavy-handed solution could be implemented with sort
but that would be a rather unwieldy tool for a backend programming API.
recode
will do this thing correctly, but you didn't seem too in love with the program the other day. Maybe today's edits will cast a more friendly light on it or maybe not.
GNU also offers the gettext
function library, and it seems to be able to address this problem at least for the LC_MESSAGES
context:
— Function:
char * bind_textdomain_codeset
(const char *domainname, const char *codeset
)The
bind_textdomain_codeset
function can be used to specify the output character set for message catalogs for domain domainname. The codeset argument must be a valid codeset name which can be used for the iconv_open function, or a null pointer.If the codeset parameter is the null pointer,
bind_textdomain_codeset
returns the currently selected codeset for the domain with the name domainname. It returns NULL if no codeset has yet been selected.The
bind_textdomain_codeset
function can be used several times. If used multiple times with the same domainname argument, the later call overrides the settings made by the earlier one.The
bind_textdomain_codeset
function returns a pointer to a string containing the name of the selected codeset. The string is allocated internally in the function and must not be changed by the user. If the system went out of core during the execution ofbind_textdomain_codeset
, the return value is NULL and the global variable errno is set accordingly.
You might also use native Unicode character categories, which are language independent and forego the POSIX classes altogether, or perhaps to call on the former to provide you enough information to define the latter.
In addition to complications, Unicode also brings new possibilities. One is that each Unicode character belongs to a certain category. You can match a single character belonging to the "letter" category with
\p{L}
. You can match a single character not belonging to that category with\P{L}
.Again, "character" really means "Unicode code point".
\p{L}
matches a single code point in the category "letter". If your input string isà
encoded asU+0061 U+0300
, it matchesa
without the accent. If the input isà
encoded asU+00E0
, it matchesà
with the accent. The reason is that both the code pointsU+0061 (a)
andU+00E0 (à)
are in the category "letter", whileU+0300
is in the category "mark".You should now understand why
\P{M}\p{M}*+
is the equivalent of\X
.\P{M}
matches a code point that is not a combining mark, while\p{M}*+
matches zero or more code points that are combining marks. To match a letter including any diacritics, use\p{L}\p{M}*+
. This last regex will always matchà
, regardless of how it is encoded. The possessive quantifier makes sure that backtracking doesn't cause\P{M}\p{M}*+
to match a non-mark without the combining marks that follow it, which\X
would never do.
The same website that provided the above information also discusses Tcl
's own POSIX-compliant regex implementation which might be yet another way to achieve your goal.
And last among solutions I will suggest that you can interrogate the LC_COLLATE
file itself for the complete and in-order system character map. This may not seem easily done, but I achieved some success with the following after compiling it with localedef
as demonstrated below:
<LC_COLLATE od -j2K -a -w2048 -v |
tail -n2 |
cut -d' ' -f$(seq -s',' 4 2 2048) |
sed 's/nul\|\\0//g;s/ */ /g;:s;
s/\([^ ]\{1,3\}\) \1/\1/;ts;
s/\(\([^ ][^ ]* *\)\{16\}\)/\1\n/g'
dc1 dc2 dc3 dc4 nak syn etb can c fs c rs c sp ! "
# $ % & ' ( ) * + , - . / 0 1 2
3 4 5 6 7 8 9 : ; < = > ? @ A B
C D E F G H I J K L M N O P Q R
S T U V W X Y Z [ \ ] ^ _ ` a b
c d e f g h i j k l m n o p q r
s t u v w x y z { | } ~ del soh stx etx
eot enq ack bel c ht c vt cr c si dle dc1 del
It is, admittedly, currently flawed but I hope it demonstrates the possibility at least.
AT FIRST BLUSH
strings $_/en_GB
#OUTPUT
int_select "<U0030><U0030>"
...
END LC_TELEPHONE
It really didn't look like much but then I started noticing copy
commands throughout the list. The above file seems to copy
in "en_US" for instance, and another real big one that it seems they all share to some degree is iso_14651_t1_common
.
Its pretty big:
strings $_ | wc -c
#OUTPUT
431545
Here is the intro to /usr/share/i18n/locales/POSIX
:
# Territory:
# Revision: 1.1
# Date: 1997-03-15
# Application: general
# Users: general
# Repertoiremap: POSIX
# Charset: ISO646:1993
# Distribution and use is free, also for
# commercial purposes.
LC_CTYPE
# The following is the POSIX Locale LC_CTYPE.
# "alpha" is by default "upper" and "lower"
# "alnum" is by definiton "alpha" and "digit"
# "print" is by default "alnum", "punct" and the <U0020> character
# "graph" is by default "alnum" and "punct"
upper <U0041>;<U0042>;<U0043>;<U0044>;<U0045>;<U0046>;<U0047>;<U0048>;\
<U0049>;<U004A>;<U004B>;<U004C>;<U004D>;<U004E>;<U004F>;
...
You can grep
through this of course, but you might just:
recode -lf gb
Instead. You'd get something like this:
Dec Oct Hex UCS2 Mne BS_4730
0 000 00 0000 NU null (nul)
1 001 01 0001 SH start of heading (soh)
...
... AND MORE
There is also luit
terminal UTF-8 pty
translation device I guess that acts a go-between for XTerms without UTF-8 support. It handles a lot of switches - such as logging all converted bytes to a file or -c
as a simple |pipe
filter.
I never realized there was so much to this - the locales and character maps and all of that. This is apparently a very big deal but I guess it all goes on behind the scenes. There are - at least on my system - a couple hundred man 3
related results for locale related searches.
And also there is:
zcat /usr/share/i18n/charmaps/UTF-8*gz | less
CHARMAP
<U0000> /x00 NULL
<U0001> /x01 START OF HEADING
<U0002> /x02 START OF TEXT
<U0003> /x03 END OF TEXT
<U0004> /x04 END OF TRANSMISSION
<U0005> /x05 ENQUIRY
...
That will go on for a very long while.
The Xlib
functions handle this all of the time - luit
is a part of that package.
The Tcl_uni...
functions might prove useful as well.
just a little <tab>
completion and man
searches and I've learned quite a lot on this subject.
With localedef
- you can compile the locales
in your I18N
directory. The output is funky, and not extraordinarily useful - not like the charmaps
at all - but you can get the raw format just as you specify above like I did:
mkdir -p dir && cd $_ ; localedef -f UTF-8 -i en_GB ./
ls -l
total 1508
drwxr-xr-x 1 mikeserv mikeserv 30 May 6 18:35 LC_MESSAGES
-rw-r--r-- 1 mikeserv mikeserv 146 May 6 18:35 LC_ADDRESS
-rw-r--r-- 1 mikeserv mikeserv 1243766 May 6 18:35 LC_COLLATE
-rw-r--r-- 1 mikeserv mikeserv 256420 May 6 18:35 LC_CTYPE
-rw-r--r-- 1 mikeserv mikeserv 376 May 6 18:35 LC_IDENTIFICATION
-rw-r--r-- 1 mikeserv mikeserv 23 May 6 18:35 LC_MEASUREMENT
-rw-r--r-- 1 mikeserv mikeserv 290 May 6 18:35 LC_MONETARY
-rw-r--r-- 1 mikeserv mikeserv 77 May 6 18:35 LC_NAME
-rw-r--r-- 1 mikeserv mikeserv 54 May 6 18:35 LC_NUMERIC
-rw-r--r-- 1 mikeserv mikeserv 34 May 6 18:35 LC_PAPER
-rw-r--r-- 1 mikeserv mikeserv 56 May 6 18:35 LC_TELEPHONE
-rw-r--r-- 1 mikeserv mikeserv 2470 May 6 18:35 LC_TIME
Then with od
you can read it - bytes and strings:
od -An -a -t u1z -w12 LC_COLLATE | less
etb dle enq sp dc3 nul nul nul T nul nul nul
23 16 5 32 19 0 0 0 84 0 0 0 >... ....T...<
...
Though it is a long way off from winning a beauty contest, that is usable output. And od
is as configurable as you want it to be as well, of course.
I guess I also forgot about these:
perl -mLocale
-- Perl module --
Locale::Codes Locale::Codes::LangFam Locale::Codes::Script_Retired
Locale::Codes::Constants Locale::Codes::LangFam_Codes Locale::Country
Locale::Codes::Country Locale::Codes::LangFam_Retired Locale::Currency
Locale::Codes::Country_Codes Locale::Codes::LangVar Locale::Language
Locale::Codes::Country_Retired Locale::Codes::LangVar_Codes Locale::Maketext
Locale::Codes::Currency Locale::Codes::LangVar_Retired Locale::Maketext::Guts
Locale::Codes::Currency_Codes Locale::Codes::Language Locale::Maketext::GutsLoader
Locale::Codes::Currency_Retired Locale::Codes::Language_Codes Locale::Maketext::Simple
Locale::Codes::LangExt Locale::Codes::Language_Retired Locale::Script
Locale::Codes::LangExt_Codes Locale::Codes::Script Locale::gettext
Locale::Codes::LangExt_Retired Locale::Codes::Script_Codes locale
I probably forgot about them because I couldn't get them to work. I never use Perl
and I don't know how to load a module properly I guess. But the man
pages look pretty nice. In any case, something tells me you'll find calling a Perl module at least a little less difficult than did I. And, again, these were already on my computer - and I never even use Perl. There are also notably a few I18N
that I wistfully scrolled by knowing full well I wouldn't get them to work either.