How to remove all of the diacritics from a file?

If you check the man page of the tool iconv:

//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similarly looking characters.

so we could do :

kent$  cat test1
    Replace Ä, Ã¡, Ç, and Ã  with a.
    Replace Ä, Ã©, Ä, and Ã¨ with e.
    Replace Ä«, Ã, Ç, and Ã¬ with i.
    Replace Å, Ã³, Ç, and Ã² with o.
    Replace Å«, Ãº, Ç, and Ã¹ with u.
    Replace Ç, Ç, Ç, and Ç with Ã¼.
    Replace Ä, Ã, Ç, and Ã with A.
    Replace Ä, Ã, Ä, and Ã with E.
    Replace Äª, Ã, Ç, and Ã with I.
    Replace Å, Ã, Ç, and Ã with O.
    Replace Åª, Ã, Ç, and Ã with U.
    Replace Ç, Ç, Ç, and Ç with U.


kent$  iconv -f utf8 -t ascii//TRANSLIT test1
    Replace a, a, a, and a with a.
    Replace e, e, e, and e with e.
    Replace i, i, i, and i with i.
    Replace o, o, o, and o with o.
    Replace u, u, u, and u with u.
    Replace u, u, u, and u with u.
    Replace A, A, A, and A with A.
    Replace E, E, E, and E with E.
    Replace I, I, I, and I with I.
    Replace O, O, O, and O with O.
    Replace U, U, U, and U with U.
    Replace U, U, U, and U with U.

This might work for you:

sed -i 'y/ÄÃ¡ÇÃ ÄÃ©ÄÃ¨Ä«ÃÇÃ¬ÅÃ³ÇÃ²Å«ÃºÇÃ¹ÇÇÇÇÄÃÇÃÄÃÄÃÄªÃÇÃÅÃÇÃÅªÃÇÃÇÇÇÇ/aaaaeeeeiiiioooouuuuÃ¼Ã¼Ã¼Ã¼AAAAEEEEIIIIOOOOUUUUÃÃÃÃ/' file

I like iconv as it handles all accents variations :

cat non-ascii.txt | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > ascii.txt

For this the tr(1) command is for. For example:

tr 'ÄÃ¡ÇÃ ÄÃ©ÄÃ¨Ä«ÃÇÃ¬...' 'aaaaeeeeiii...' <infile >outfile

You may have to check/change your LANG environment variable to match the character set being used.

How to remove all of the diacritics from a file?

Tags:

Bash

Replace

Sed

Related

Recent Posts