Replacing special characters in pandas dataframe
If someone get the following error message
multiple repeat at position 2
try this df.replace(dictionary, regex=False, inplace=True)
instead of
df.replace(dictionary, regex=True, inplace=True)
replace
works out of the box without specifying a specific column in Python 3.
Load Data:
df=pd.read_csv('test.csv', sep=',', low_memory=False, encoding='iso8859_15')
df
Result:
col1 col2
0 he hello
1 Nícolas shárk
2 welcome yes
Create Dictionary:
dictionary = {'í':'i', 'á':'a'}
Replace:
df.replace(dictionary, regex=True, inplace=True)
Result:
col1 col2
0 he hello
1 Nicolas shark
2 welcome yes
The docs on pandas.DataFrame.replace
says you have to provide a nested dictionary: the first level is the column name for which you have to provide a second dictionary with substitution pairs.
So, this should work:
>>> df=pd.DataFrame({'a': ['NÍCOLAS','asdč'], 'b': [3,4]})
>>> df
a b
0 NÍCOLAS 3
1 asdč 4
>>> df.replace({'a': {'č': 'c', 'Í': 'I'}}, regex=True)
a b
0 NICOLAS 3
1 asdc 4
Edit. Seems pandas
also accepts non-nested translation dictionary. In that case, the problem is probably with character encoding, particularly if you use Python 2. Assuming your CSV load function decoded the file characters properly (as true Unicode code-points), then you should take care your translation/substitution dictionary is also defined with Unicode characters, like this:
dictionary = {u'í': 'i', u'á': 'a'}
If you have a definition like this (and using Python 2):
dictionary = {'í': 'i', 'á': 'a'}
then the actual keys in that dictionary are multibyte strings. Which bytes (characters) they are depends on the actual source file character encoding used, but presuming you use UTF-8, you'll get:
dictionary = {'\xc3\xa1': 'a', '\xc3\xad': 'i'}
And that would explain why pandas
fails to replace those chars. So, be sure to use Unicode literals in Python 2: u'this is unicode string'
.
On the other hand, in Python 3, all strings are Unicode strings, and you don't have to use the u
prefix (in fact unicode
type from Python 2 is renamed to str
in Python 3, and the old str
from Python 2 is now bytes
in Python 3).