Convert between Unicode Normalization Forms on the unix command-line
You can use the uconv
utility from ICU. Normalization is achieved through transliteration (-x
).
$ uconv -x any-nfd <<<ä | hd
00000000 61 cc 88 0a |a...|
00000004
$ uconv -x any-nfc <<<ä | hd
00000000 c3 a4 0a |...|
00000003
On Debian, Ubuntu and other derivatives, uconv
is in the libicu-dev
package. On Fedora, Red Hat and other derivatives, and in BSD ports, it's in the icu
package.
Python has unicodedata
module in its standard library, which allow to translate Unicode representations through unicodedata.normalize()
function:
import unicodedata
s1 = 'Spicy Jalape\u00f1o'
s2 = 'Spicy Jalapen\u0303o'
t1 = unicodedata.normalize('NFC', s1)
t2 = unicodedata.normalize('NFC', s2)
print(t1 == t2)
print(ascii(t1))
t3 = unicodedata.normalize('NFD', s1)
t4 = unicodedata.normalize('NFD', s2)
print(t3 == t4)
print(ascii(t3))
Running with Python 3.x:
$ python3 test.py
True
'Spicy Jalape\xf1o'
True
'Spicy Jalapen\u0303o'
Python isn't well suited for shell one liners, but it can be done if you don't want to create external script:
$ python3 -c $'import unicodedata\nprint(unicodedata.normalize("NFC", "ääääää"))'
ääääää
For Python 2.x you have to add encoding line (# -*- coding: utf-8 -*-
) and mark strings as Unicode with u character:
$ python -c $'# -*- coding: utf-8 -*-\nimport unicodedata\nprint(unicodedata.normalize("NFC", u"ääääää"))'
ääääää
For completeness, with perl
:
$ perl -CSA -MUnicode::Normalize=NFD -e 'print NFD($_) for @ARGV' $'\ue1' | uconv -x name
\N{LATIN SMALL LETTER A}\N{COMBINING ACUTE ACCENT}
$ perl -CSA -MUnicode::Normalize=NFC -e 'print NFC($_) for @ARGV' $'a\u301' | uconv -x name
\N{LATIN SMALL LETTER A WITH ACUTE}