Converting a UTF-8 file to ASCII (best-effort)
This will work for some things:
iconv -f utf-8 -t ascii//TRANSLIT
echo ĥéĺłœ π | iconv -f utf-8 -t ascii//TRANSLIT
returns helloe ?
. Any characters that iconv
doesn’t know how to convert will be replaced with question marks.
iconv
is POSIX, but I don’t know if all systems have the TRANSLIT
option. It works for me on Linux. Also, the IGNORE
option will silently discard characters that cannot be represented in the target character set (see man iconv_open
).
An inferior but POSIX-compliant option is to use tr
. This command replaces all non-ASCII code points with a question mark. It reads UTF-8 text one byte at a time. “É” might be replaced with E?
or ?
, depending on whether it was encoded using a combining accent or a precomposed character.
echo café äëïöü | tr -d '\200-\277' | tr '\300-\377' '[?*]'
That example returns caf? ?????
, using precomposed characters.
konwert utf8-ascii
It will do best-effort conversion, depending on the conversion tables. If you know approximately the input language, there are language specific filters giving better results, e.g.
konwert utf8-xmetodo
is the conversion of Esperanto into the x-metodo representation,
konwert UTF8-tex
will try to do TeX representation of diacritics, there are language specific parameters:
konwert UTF8-ascii/de
will transliterate "ä" into "ae" (customary for German) instead of plain "a"
konwert UTF8-ascii/rosyjski
will use Polish rules for transliterating Russian, instead of the "English-like" ones, etc...
try uni2ascii -B input.txt >output.txt
uni2ascii