Undocumented Java regex character class: \p{C}

Buried down in the Pattern docs under Unicode Support, we find the following:

This class is in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression, plus RL2.1 Canonical Equivalents.

...

Categories may be specified with the optional prefix Is: Both \p{L} and \p{IsL} denote the category of Unicode letters. Same as scripts and blocks, categories can also be specified by using the keyword general_category (or its short form gc) as in general_category=Lu or gc=Lu.

The supported categories are those of The Unicode Standard in the version specified by the Character class. The category names are those defined in the Standard, both normative and informative.

From Unicode Technical Standard #18, we find that C is defined to match any Other General_Category value, and that support for this is part of the requirements for Level 1 conformance. Java implements \p{C} because it claims conformance to Level 1 of UTS #18.


It probably should support \p{Other}, but apparently it doesn't.

Worse, it's violating RL1.7, required for Level 1 conformance, which requires that matching happen by code point instead of code unit:

To meet this requirement, an implementation shall handle the full range of Unicode code points, including values from U+FFFF to U+10FFFF. In particular, where UTF-16 is used, a sequence consisting of a leading surrogate followed by a trailing surrogate shall be handled as a single code point in matching.

There should be no matches for \p{C} in your test string, because your test string should be matched as a single emoji code point with General_Category=So (Other Symbol) instead of as two surrogates.


According to https://regex101.com/, \p{C} matches

Invisible control characters and unused code points

(the \ has to be escaped because java string, so string \\p{C} is regex \p{C})

I'm guessing this is a 'hacked string check' as a \p{C} probably should never appear inside a valid (character filled) string, but the author should have left a comment as what they checked and what they wanted to check are usually 2 different things.