Which character encodings are supported by posix?
There is no specific character encoding mandated by POSIX. The only character in a fixed position is null, which must be 00.
What POSIX does require is that all characters from its Portable Character Set exist. The Portable Character Set contains the printable ASCII characters, space, BEL, backspace, tab, carriage return, newline, vertical tab, form feed, and null. Where or how those are encoded is not specified, except that:
- They are all a single byte (8 bits).
- Null is represented with all bits zero.
- The digits 0-9 appear contiguously in that order.
It imposes no other restrictions on the representation of characters, so a conforming system is free to support encodings with any representation of those characters, and any other characters in addition.
Different locales on the same system can have different representations of those characters, with the exception of .
and /
, and
if an application uses any pair of locales where the character encodings differ, or accesses data from an application using a locale which has different encodings from the locales used by the application, the results are unspecified.
The only files that all POSIX-compliant systems are required to treat in the same way are files consisting entirely of null bytes. Files treated as text have their lines terminated by the encoding's representation of the PCS's newline character.
The POSIX standard introduces a POSIX locale, which has the same order as the ASCII character set for characters in ASCII (POSIX Base Definitions §7.3.2).
Besides that, on systems where the symbolic constant POSIX2_LOCALEDEF
is defined (which shall be defined for XSI-conformant systems, and can be tested via getconf POSIX2_LOCALEDEF
), the system supports the creation of new locales, using the localedef
utility, and locale definitions as specified in POSIX Base Definitions §7.3.
POSIX locale definitions don't support specifying characters by their Unicode value, there are newer standards, such as ISO/IEC TR 14652 (available at the ISO/IEC JTC1/SC22/WG20 home), and ISO TR 30112 (draft available at the ISO/IEC JTC1/SC35/WG5 home) which obsoletes ISO/IEC TR 14652.
Other related standards are ISO 14651 (available at the ISO ITTF site) and the Unicode Collation Algorithm (UCA, Unicode UTS#10).
The Unicode::Tussle Perl module at CPAN includes Unicode rewrites of several Unix utilities. sed and awk scripts and one-liners can (relatively easily) be rewritten in Perl for Unicode support.
For glibc, bugzilla entries for component localedata can provide a view of the status of different locales.