How to remove all of the diacritics from a file?
If you check the man page of the tool iconv
:
//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similarly looking characters.
so we could do :
kent$ cat test1
Replace Ä, á, Ç, and à with a.
Replace Ä, é, Ä, and è with e.
Replace Ä«, Ã, Ç, and ì with i.
Replace Å, ó, Ç, and ò with o.
Replace Å«, ú, Ç, and ù with u.
Replace Ç, Ç, Ç, and Ç with ü.
Replace Ä, Ã, Ç, and à with A.
Replace Ä, Ã, Ä, and à with E.
Replace Ī, Ã, Ç, and à with I.
Replace Å, Ã, Ç, and à with O.
Replace Ū, Ã, Ç, and à with U.
Replace Ç, Ç, Ç, and Ç with U.
kent$ iconv -f utf8 -t ascii//TRANSLIT test1
Replace a, a, a, and a with a.
Replace e, e, e, and e with e.
Replace i, i, i, and i with i.
Replace o, o, o, and o with o.
Replace u, u, u, and u with u.
Replace u, u, u, and u with u.
Replace A, A, A, and A with A.
Replace E, E, E, and E with E.
Replace I, I, I, and I with I.
Replace O, O, O, and O with O.
Replace U, U, U, and U with U.
Replace U, U, U, and U with U.
This might work for you:
sed -i 'y/ÄáÇà ÄéÄèīÃÇìÅóÇòūúÇùÇÇÇÇÄÃÇÃÄÃÄÃĪÃÇÃÅÃÇÃŪÃÇÃÇÇÇÇ/aaaaeeeeiiiioooouuuuüüüüAAAAEEEEIIIIOOOOUUUUÃÃÃÃ/' file
I like iconv
as it handles all accents variations :
cat non-ascii.txt | iconv -f utf8 -t ascii//TRANSLIT//IGNORE > ascii.txt
For this the tr(1) command is for. For example:
tr 'ÄáÇà ÄéÄèīÃÇì...' 'aaaaeeeeiii...' <infile >outfile
You may have to check/change your LANG
environment variable to match the character set being used.