What does [[.ch.]] mean in a regex?
Collation elements are usually referenced in the context of sorting.
In many languages, collation (sorting like in a dictionary) is not only done per-character. For instance, in Czech, ch
doesn't sort between cg
and ci
like it would in English, but is considered as a whole for sorting. It is a collating element (we can't refer to a character here, character are a subset of collating elements) that sorts in between h
and i
.
Now you may ask, What has that to do with regular expressions?, Why would I want to refer to a collating element in a bracket expression?.
Well, inside bracket expressions, one does use order. For instance in [c-j]
, you want the characters in between c
and j
. Well, do you? You'd rather want collating elements there. [h-i]
in a Czech locale matches ch
:
$ echo cho | LC_ALL=cs_CZ.UTF-8 grep '^[h-i]o'
cho
So, if you're able to list a range of collating elements in a bracket expression, then you'd expect to be able to list them individually as well. [a-cch]
would match that collating elements in between a
and c
and the c
and h
characters. To have a-c
and the ch
collating element, we need a new syntax:
$ echo cho | LC_ALL=cs_CZ.UTF-8 grep '^[a-c[.ch.]]o'
cho
(the ones in between a
and c
and the ch
one).
Now, the world is not perfect yet and probably never will. The example above was on a GNU system and worked. Another example of a collating element could be e
with a combining acute accent in UTF-8 ($'e\u0301'
rendered like $'\u00e9'
as é
).
é and é are the same character except that one is represented with one character and the other one with two.
$ echo $'e\u301t\ue9' | grep '^[d-f]t'
Will work properly on some systems but not others (not GNU ones for instance). And it's unclear whether $'[[.\ue9.]]'
should match only $'\ue9'
or both $'\ue9'
and $'e\u301'
.
Not to mention non-alphabetic scripts, or scripts with different, regional, sorting orders, things like ffi (ffi
in one character) which become tricky to handle with such a simple API.