What does [[.ch.]] mean in a regex?

Collation elements are usually referenced in the context of sorting.

In many languages, collation (sorting like in a dictionary) is not only done per-character. For instance, in Czech, ch doesn't sort between cg and ci like it would in English, but is considered as a whole for sorting. It is a collating element (we can't refer to a character here, character are a subset of collating elements) that sorts in between h and i.

Now you may ask, What has that to do with regular expressions?, Why would I want to refer to a collating element in a bracket expression?.

Well, inside bracket expressions, one does use order. For instance in [c-j], you want the characters in between c and j. Well, do you? You'd rather want collating elements there. [h-i] in a Czech locale matches ch:

Click to copy

$ echo cho | LC_ALL=cs_CZ.UTF-8 grep '^[h-i]o'
cho

So, if you're able to list a range of collating elements in a bracket expression, then you'd expect to be able to list them individually as well. [a-cch] would match that collating elements in between a and c and the c and h characters. To have a-c and the ch collating element, we need a new syntax:

Click to copy

$ echo cho | LC_ALL=cs_CZ.UTF-8 grep '^[a-c[.ch.]]o'
cho

(the ones in between a and c and the ch one).

Now, the world is not perfect yet and probably never will. The example above was on a GNU system and worked. Another example of a collating element could be e with a combining acute accent in UTF-8 ($'e\u0301' rendered like $'\u00e9' as é).

é and é are the same character except that one is represented with one character and the other one with two.

Click to copy

$ echo $'e\u301t\ue9' | grep '^[d-f]t'

Will work properly on some systems but not others (not GNU ones for instance). And it's unclear whether $'[[.\ue9.]]' should match only $'\ue9' or both $'\ue9' and $'e\u301'.

Not to mention non-alphabetic scripts, or scripts with different, regional, sorting orders, things like ﬃ (ffi in one character) which become tricky to handle with such a simple API.

What does [[.ch.]] mean in a regex?

Tags:

Terminology

Posix

Regular Expression

Related

Recent Posts