Converting a UTF-8 file to ASCII (best-effort)

This will work for some things:

iconv -f utf-8 -t ascii//TRANSLIT

echo ĥéĺłœ π | iconv -f utf-8 -t ascii//TRANSLIT returns helloe ?. Any characters that iconv doesn’t know how to convert will be replaced with question marks.

iconv is POSIX, but I don’t know if all systems have the TRANSLIT option. It works for me on Linux. Also, the IGNORE option will silently discard characters that cannot be represented in the target character set (see man iconv_open).

An inferior but POSIX-compliant option is to use tr. This command replaces all non-ASCII code points with a question mark. It reads UTF-8 text one byte at a time. “É” might be replaced with E? or ?, depending on whether it was encoded using a combining accent or a precomposed character.

echo café äëïöü | tr -d '\200-\277' | tr '\300-\377' '[?*]'

That example returns caf? ?????, using precomposed characters.

konwert utf8-ascii

It will do best-effort conversion, depending on the conversion tables. If you know approximately the input language, there are language specific filters giving better results, e.g.

konwert utf8-xmetodo

is the conversion of Esperanto into the x-metodo representation,

konwert UTF8-tex

will try to do TeX representation of diacritics, there are language specific parameters:

konwert UTF8-ascii/de

will transliterate "ä" into "ae" (customary for German) instead of plain "a"

konwert UTF8-ascii/rosyjski

will use Polish rules for transliterating Russian, instead of the "English-like" ones, etc...

try uni2ascii -B input.txt >output.txt

uni2ascii

Converting a UTF-8 file to ASCII (best-effort)

Tags:

Text

Character Encoding

Natural Language

Related

Recent Posts