Replacing cyrillic symbol in attribute table using python?
Remember that all of ArcGIS' Python stuff dealing with strings starting in 9.3 uses Unicode objects, which will make your life quite a bit easier because encoding becomes less of a big deal in the data. You'll still need to think about it in your scripts, but if you use UTF-8 in any Python source you write outside of dialogs in Arc*.exe ArcGIS will handle it fine. When looking out into scripts on disk, ArcGIS assumes they're in UTF-8.
Now, opening up charmap we find the Cyrillic characters nestled comfortably in between Coptic and Armenian. The first character is capital Io (Ё), which charmap tells us is U+0401. In Python, this can be represented in a unicode literal as u'\u0401'
. The last character is small Yeru with diaresis (ӹ), U+04F9 (u'\u04f9'
). So we want to filter out all the characters 0x0401-0x04f9.
This is one of the very few places where a regular expression really is the right tool for the job. To represent the range, we use the expression [\u0401-\u04f9]
. From there, we can use re.sub
to replace it with nothing. This is how we'd do that:
my_new_attribute_string = re.sub(u'[\u0401-\u04f9]', '', my_old_attribute_string)
And this should do it.
You can also use re.compile
if you're going to reuse the same expression over and over to save some time. For example, a script that replaces all cyrillic in a specific column in a table:
import re
import arcpy
cyrillic_substitution = re.compile(u'[\u0401-\u04f9]')
with arcpy.da.UpdateCursor(my_feature_class, column_with_cyrillic) as cur:
for row in cur:
row[0] = cyrillic_substitution.sub("", row[0])
cur.updateRow(row)
There is a good answer on SO that translates Cyrillic to Latin. Here is the logic:
symbols = (u"абвгдеёзийклмнопрстуфхъыьэАБВГДЕЁЗИЙКЛМНОПРСТУФХЪЫЬЭ",
u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")
tr = {ord(a): ord(b) for a, b in zip(*symbols)}
def cyrillic2latin(input):
return input.translate(tr)
E.g. cyrillic2latin(u'Москва')
returns u'Moskva'
.