Replacing special characters in pandas dataframe

If someone get the following error message

multiple repeat at position 2

try this df.replace(dictionary, regex=False, inplace=True)

instead of df.replace(dictionary, regex=True, inplace=True)

replace works out of the box without specifying a specific column in Python 3.

Load Data:

df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df

Result:

col1    col2
0   he  hello
1   Nícolas shárk
2   welcome yes

Create Dictionary:

dictionary = {'í':'i', 'á':'a'}

Replace:

df.replace(dictionary, regex=True, inplace=True)

Result:

 col1   col2
0   he  hello
1   Nicolas shark
2   welcome yes

The docs on pandas.DataFrame.replace says you have to provide a nested dictionary: the first level is the column name for which you have to provide a second dictionary with substitution pairs.

So, this should work:

>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
         a  b
0  NÍCOLAS  3
1     asdč  4

>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
         a  b
0  NICOLAS  3
1     asdc  4

Edit. Seems pandas also accepts non-nested translation dictionary. In that case, the problem is probably with character encoding, particularly if you use Python 2. Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:

dictionary = {u'í': 'i', u'á': 'a'}

If you have a definition like this (and using Python 2):

dictionary = {'í': 'i', 'á': 'a'}

then the actual keys in that dictionary are multibyte strings. Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:

dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}

And that would explain why pandas fails to replace those chars. So, be sure to use Unicode literals in Python 2: u'this is unicode string'.

On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u prefix (in fact unicode type from Python 2 is renamed to str in Python 3, and the old str from Python 2 is now bytes in Python 3).

Replacing special characters in pandas dataframe

Tags:

Python

Pandas

Related

Recent Posts