How to protect against diacritics such as Zalgo text
is there even a limit?!
Not intrinsically in Unicode. There is the concept of a 'Stream-Safe' format in UAX-15 that sets a limit of 30 combiners... Unicode strings in general are not guaranteed to be Stream-Safe, but this could certainly be taken as a sign that Unicode don't intend to standardise new characters that would require a grapheme cluster longer than that.
30 is still an awful lot. The longest known natural-language grapheme cluster is the Tibetan Hakṣhmalawarayaṁ at 1 base plus 8 combiners, so for now it would be reasonable to normalise to NFD and disallow any sequence of more than 8 combiners in a row.
If you only care about common Western European languages you can probably bring that down to 2. So potentially compromise somewhere between those.
I think I found a solution using NormalizationForm.FormC
instead of NormalizationForm.FormD
. According to the MSDN:
[FormC] Indicates that a Unicode string is normalized using full canonical decomposition, followed by the replacement of sequences with their primary composites, if possible.
I take that to mean that it decomposes characters to their base form, then recomposes them based on a set of rules that remain consistent. I gather this is useful for comparison purposes, but in my case it works perfect. Characters like ü
, é
, and Ä
are decomposed/recomposed accurately, while the bogus characters fail to recompose, and thus remain in their base form: