How to replace unicode characters by ascii characters in Python (perl script given)?

For converting to ASCII you might want to try ASCII, Dammit or this recipe, which boils down to:

>>> title = u"Klüft skräms inför på fédéral électoral große"
>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'

  • Use the fileinput module to loop over standard input or a list of files,
  • decode the lines you read from UTF-8 to unicode objects
  • then map any unicode characters you desire with the translate method

translit.py would look like this:

#!/usr/bin/env python2.6
# -*- coding: utf-8 -*-

import fileinput

table = {
          0xe4: u'ae',
          ord(u'ö'): u'oe',
          ord(u'ü'): u'ue',
          ord(u'ß'): None,
        }

for line in fileinput.input():
    s = line.decode('utf8')
    print s.translate(table), 

And you could use it like this:

$ cat utf8.txt 
sömé täßt
sömé täßt
sömé täßt

$ ./translit.py utf8.txt 
soemé taet
soemé taet
soemé taet
  • Update:

In case you are using python 3 strings are by default unicode and you dont' need to encode it if it contains non-ASCII characters or even a non-Latin characters. So the solution will look as follow:

line = 'Verhältnismäßigkeit, Möglichkeit'

table = {
         ord('ä'): 'ae',
         ord('ö'): 'oe',
         ord('ü'): 'ue',
         ord('ß'): 'ss',
       }

line.translate(table)

>>> 'Verhaeltnismaessigkeit, Moeglichkeit'

You could try unidecode to convert Unicode into ascii instead of writing manual regular expressions. It is a Python port of Text::Unidecode Perl module:

#!/usr/bin/env python
import fileinput
import locale
from contextlib import closing
from unidecode import unidecode # $ pip install unidecode

def toascii(files=None, encoding=None, bufsize=-1):
    if encoding is None:
        encoding = locale.getpreferredencoding(False)
    with closing(fileinput.FileInput(files=files, bufsize=bufsize)) as file:
        for line in file: 
            print unidecode(line.decode(encoding)),

if __name__ == "__main__":
    import sys
    toascii(encoding=sys.argv.pop(1) if len(sys.argv) > 1 else None)

It uses FileInput class to avoid global state.

Example:

$ echo 'äöüß' | python toascii.py utf-8
aouss