How to replace unicode characters by ascii characters in Python (perl script given)?
For converting to ASCII you might want to try ASCII, Dammit or this recipe, which boils down to:
>>> title = u"Klüft skräms inför på fédéral électoral große"
>>> import unicodedata
>>> unicodedata.normalize('NFKD', title).encode('ascii','ignore')
'Kluft skrams infor pa federal electoral groe'
- Use the
fileinput
module to loop over standard input or a list of files, - decode the lines you read from UTF-8 to unicode objects
- then map any unicode characters you desire with the
translate
method
translit.py
would look like this:
#!/usr/bin/env python2.6
# -*- coding: utf-8 -*-
import fileinput
table = {
0xe4: u'ae',
ord(u'ö'): u'oe',
ord(u'ü'): u'ue',
ord(u'ß'): None,
}
for line in fileinput.input():
s = line.decode('utf8')
print s.translate(table),
And you could use it like this:
$ cat utf8.txt
sömé täßt
sömé täßt
sömé täßt
$ ./translit.py utf8.txt
soemé taet
soemé taet
soemé taet
- Update:
In case you are using python 3 strings are by default unicode and you dont' need to encode it if it contains non-ASCII characters or even a non-Latin characters. So the solution will look as follow:
line = 'Verhältnismäßigkeit, Möglichkeit'
table = {
ord('ä'): 'ae',
ord('ö'): 'oe',
ord('ü'): 'ue',
ord('ß'): 'ss',
}
line.translate(table)
>>> 'Verhaeltnismaessigkeit, Moeglichkeit'
You could try unidecode
to convert Unicode into ascii instead of writing manual regular expressions. It is a Python port of Text::Unidecode
Perl module:
#!/usr/bin/env python
import fileinput
import locale
from contextlib import closing
from unidecode import unidecode # $ pip install unidecode
def toascii(files=None, encoding=None, bufsize=-1):
if encoding is None:
encoding = locale.getpreferredencoding(False)
with closing(fileinput.FileInput(files=files, bufsize=bufsize)) as file:
for line in file:
print unidecode(line.decode(encoding)),
if __name__ == "__main__":
import sys
toascii(encoding=sys.argv.pop(1) if len(sys.argv) > 1 else None)
It uses FileInput
class to avoid global state.
Example:
$ echo 'äöüß' | python toascii.py utf-8
aouss