Text run is not in Unicode Normalization Form C

What does it mean?

From W3C:

In Unicode it is possible to produce the same text with different sequences of characters. For example, take the Hungarian word világ. The fourth letter could be stored in memory as a precomposed U+00E1 LATIN SMALL LETTER A WITH ACUTE (a single character) or as a decomposed sequence of U+0061 LATIN SMALL LETTER A followed by U+0301 COMBINING ACUTE ACCENT (two characters).

világ = világ

The Unicode Standard allows either of these alternatives, but requires that both be treated as identical. To improve efficiency, an application will usually normalize text before performing searches or comparisons. Normalization, in this case, means converting the text to use all precomposed or all decomposed characters.

There are four normalization forms specified by the Unicode Standard: NFC, NFD, NFKC and NFKD. The C stands for (pre-)composed, and the D for decomposed. The K stands for compatibility. To improve interoperability, the W3C recommends the use of NFC normalized text on the Web.

Besides "to improve interoperability", precomposed text usually looks better than decomposes text.

How can I fix this with free tools

By using the function equivalent to Python's text = unicodedata.normalize('NFC', text) in your favorite programming language.

(Or, if you weren't planning to write a program, your question should be moved to superuser or webmasters.)

A. It means what it says (see dan04’s explanation for a brief answer and the Unicode Standard for a long one), but it simply indicates that the authors of the validator wanted to issue the warning. HTML5 rules do not require Normalization Form C (NFC); it is rather something generally favored by the W3C.

B.There is no need to fix anything, unless you decide that using NFC would actually be better. If you do, then there are various tools for automatic conversion to NFC, such as the free BabelPad editor. If you only need to deal with one character not in NFC, you can use character information repositories such as Fileformat.info character search to find out the canonical decomposition of the character and use it.

Whether you use NFC or not depends on many considerations and on the characters involved. As a rule, NFC works better, but in some cases, an alternative, non-NFC presentation produces more suitable rendering or works better in some specific processing.

For example, in a duplicate question, the reference Ω has been reported as triggering the message. (The validator actually checks for characters entered as such references, too, instead of just plain text level NFC check.) The reference stands for U+2126 OHM SIGN “Ω”, which is defined to be canonical equivalent to U+03A9 GREEK CAPITAL LETTER OMEGA “Ω”. The Unicode Standard explicitly says that the latter is the preferred character. It is also better covered in fonts. But if you have a special reason to use OHM SIGN, you can do that, without violating current HTML5 rules, and you can ignore the validator warning.

Text run is not in Unicode Normalization Form C

Tags:

Html

Unicode

Validation

Notepad++

Unicode Normalization

Related

Recent Posts