What is the difference between strcmp() and strcoll()?
For some reason in all unicode locales I tested, on several different versions of glibc, strcoll() returns zero for any two hiraganas. This breaks sort, uniq, and everything that interacts with orders of strings in some way.
$ echo -e -n 'い\nろ\nは\nに\nほ\nへ\nと\n' | sort | uniq
い
which is simply broken beyond repair. People from different places of world might have different ideas on whether 'い' should be placed before or after 'ろ', but nobody sane would consider them the same.
And no, setting your locale to the Japanese one does not matter:
$ LC_ALL=ja_JP.utf8 LANG=ja_JP.utf8 LC_COLLATE=ja_JP.utf8 echo -e -n 'い\nろ\nは\nに\nほ\nへ\nと\n' | sort | uniq
い
There was discussion in some official mailing list, but guess what, it was in 2002 and it was never fixed because people don't care: https://www.mail-archive.com/[email protected]/msg02658.html
That bug happened to us in some day and in the end our only way out was to set the collate locale to "C" and rely on the nice properties of utf-8 encoding. That's a horrible experience, since one shouldn't really work under "C" locale when processing all-Japanese data.
So for your sanity's sake, do NOT directly use strcoll. A safer variant might be:
int safe_strcoll(const char *a, const char *b)
{
int ret = strcoll(a, b);
if (ret != 0) return ret;
return strcmp(a, b);
}
just in case strcoll() decides to screw you...
EDIT: I just repeated the experiment out of curiosity, and my current system (with glibc 2.29) works without problems now. Locale doesn't matter either.
strcmp()
takes the bytes of the string one by one and compare them as is whatever the bytes are.
strcoll()
takes the bytes, transform them using the locale, then compares the result. The transformation re-orders depending on the language. In French, accentuated letters come after the non-accentuated ones. So é is after e. However, é is before f. strcoll()
gets it right. strcmp()
not so well.
However, in many cases strcmp()
is enough because you don't need to show the result ordered in the language (locale) in use. For example, if you just need to quickly access a large number of data indexed by a string you'd use a map indexed by that string. It probably is totally useless to sort those using strcoll()
which is generally very slow (in comparison to strcmp()
at least.)
For details about characters you may also want to check out the Unicode website.
In regard to the locale, it's the language. By default it is set to "C" (more or less, no locale). Once you select a location the locale is set accordingly. You can also set the LC_LOCALE environment variable. There are actually many such variables. But in general you use predefined functions that automatically take those variables in account and do the right thing for you. (i.e. format dates / time, format numbers / measures, compute upper / lower case, etc.)