pandas - Merging on string columns not working (bug?)

Thanks,@seeiespi the ..str.encode('utf-8') has helped me to figure out that my string needs to be stripped, as below

20                 b'Belize '   ...     0,612
21                  b'Benin '   ...     0,546

The solution was to use the strip

df1.col1 = df1.col1.str.strip()
df1.col1 = df1.col1.str.strip()

I ran into a case where the df.col = df.col.astype(str) solution did not work. Turns out the problem was in the encoding.

My original data looked like this:

In [72]: df1['col1'][:3]
Out[73]: 
             col1
0  dustin pedroia
1  kevin youkilis
2     david ortiz

In [72]: df2['col2'][:3]
Out[73]: 
             col2
0  dustin pedroia
1  kevin youkilis
2     david ortiz

And after using .astype(str) the merge still wasn't working so I executed the following:

df1.col1 = df1.col1.str.encode('utf-8')
df2.col2 = df2.col2.str.encode('utf-8')

and was able to find the difference:

In [95]: df1
Out[95]: 
                       col1
0  b'dustin\xc2\xa0pedroia'
1  b'kevin\xc2\xa0youkilis'
2     b'david\xc2\xa0ortiz'

In [95]: df2
Out[95]: 
                col2
0  b'dustin pedroia'
1  b'kevin youkilis'
2     b'david ortiz'

At which point all I had to do was run df1.col1 = df1.col1.str.replace('\xa0',' ') on the decoded df1.col1 variable (i.e. before running .str.encode('utf-8')) and the merge worked perfectly.

NOTE: Regardless of what I was replacing I always used .str.encode('utf-8') to check whether it worked.

Alternatively

Using regular expressions and the Variable Explorer in the Spyder IDE for Anaconda I found the following difference.

import re
#places the raw string into a list
df1.col1 = df1.col1.apply(lambda x: re.findall(x, x))  
df2.col2 = df2.col2.apply(lambda x: re.findall(x, x))

where my df1 data turned into this (copied and pasted from Spyder):

['dustin\xa0pedroia']
['kevin\xa0youkilis']
['david\xa0ortiz']

which just has a slightly different solution. I don't know in what case the first example wouldn't work and the second would but I wanted to provide both just in case someone runs into it :)


The issue was that the object dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

The solution was to make sure that every field is a string:

>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)

Then the merge works as expected.

(I wish there was a way of specifying a dtype of str...)


None of the above solutions worked for me because the merge was actually done correctly but the indexing got messed up. Removing the index solved it for me:

df['sth'] = df.merge(df2, how='left', on=['x', 'y'])['sth'].values