Replacing cyrillic symbol in attribute table using python?

Remember that all of ArcGIS' Python stuff dealing with strings starting in 9.3 uses Unicode objects, which will make your life quite a bit easier because encoding becomes less of a big deal in the data. You'll still need to think about it in your scripts, but if you use UTF-8 in any Python source you write outside of dialogs in Arc*.exe ArcGIS will handle it fine. When looking out into scripts on disk, ArcGIS assumes they're in UTF-8.

Now, opening up charmap we find the Cyrillic characters nestled comfortably in between Coptic and Armenian. The first character is capital Io (Ё), which charmap tells us is U+0401. In Python, this can be represented in a unicode literal as u'\u0401'. The last character is small Yeru with diaresis (ӹ), U+04F9 (u'\u04f9'). So we want to filter out all the characters 0x0401-0x04f9.

This is one of the very few places where a regular expression really is the right tool for the job. To represent the range, we use the expression [\u0401-\u04f9]. From there, we can use re.sub to replace it with nothing. This is how we'd do that:

 my_new_attribute_string = re.sub(u'[\u0401-\u04f9]', '', my_old_attribute_string)

And this should do it.

You can also use re.compile if you're going to reuse the same expression over and over to save some time. For example, a script that replaces all cyrillic in a specific column in a table:

import re

import arcpy

cyrillic_substitution = re.compile(u'[\u0401-\u04f9]')

with arcpy.da.UpdateCursor(my_feature_class, column_with_cyrillic) as cur:
    for row in cur:
        row[0] = cyrillic_substitution.sub("", row[0])
        cur.updateRow(row)

There is a good answer on SO that translates Cyrillic to Latin. Here is the logic:

symbols = (u"абвгдеёзийклмнопрстуфхъыьэАБВГДЕЁЗИЙКЛМНОПРСТУФХЪЫЬЭ",
           u"abvgdeezijklmnoprstufh'y'eABVGDEEZIJKLMNOPRSTUFH'Y'E")

tr = {ord(a): ord(b) for a, b in zip(*symbols)}

def cyrillic2latin(input):
    return input.translate(tr)

E.g. cyrillic2latin(u'Москва') returns u'Moskva'.

Replacing cyrillic symbol in attribute table using python?

Tags:

Internationalization

Arcpy

Related

Recent Posts