How to convert unicode accented characters to pure ascii without accents?
I needed something like this but to remove only accented characters, ignoring special ones and I did this small function:
# ~*~ coding: utf-8 ~*~
import re
def remove_accents(string):
if type(string) is not unicode:
string = unicode(string, encoding='utf-8')
string = re.sub(u"[àáâãäå]", 'a', string)
string = re.sub(u"[èéêë]", 'e', string)
string = re.sub(u"[ìíîï]", 'i', string)
string = re.sub(u"[òóôõö]", 'o', string)
string = re.sub(u"[ùúûü]", 'u', string)
string = re.sub(u"[ýÿ]", 'y', string)
return string
I like that function because you can customize it in case you need to ignore other characters
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
Assume you have loaded your unicode into a variable called my_unicode
... normalizing à into a is this simple...
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
Explicit example...
>>> myfoo = u'àà'
>>> myfoo
u'\xe0\xe0'
>>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
>>>
How it worksunicodedata.normalize('NFD', "insert-unicode-text-here")
performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore')
to transform the NFD mapped characters into ascii (ignoring errors).
@Mike Pennington's solution works great thanks to him. but when I tried that solution I notice that it fails some special characters (i.e. ı character from Turkish alphabet) which has not defined at NFD.
I discovered another solution which you can use unidecode library to this conversion.
>>>import unidecode
>>>example = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZabcçdefgğhıijklmnoöprsştuüvyz"
#convert it to utf-8
>>>utf8text = unicode(example, "utf-8")
>>> print utf8text
ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZabcçdefgğhıijklmnoöprsştuüvyz
#convert utf-8 to ascii text
asciitext = unidecode.unidecode(utf8text)
>>>print asciitext
ABCCDEFGGHIIJKLMNOOPRSSTUUVYZabccdefgghiijklmnooprsstuuvyz