Why isn't pandas logical operator aligning on the index like it should?
Viewing the whole traceback for a Series comparison with mismatched indexes, particularly focusing on the exception message:
In [1]: import pandas as pd
In [2]: x = pd.Series([1, 2, 3], index=list('abc'))
In [3]: y = pd.Series([2, 3, 3], index=list('bca'))
In [4]: x == y
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-73b2790c1e5e> in <module>()
----> 1 x == y
/usr/lib/python3.7/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
1188
1189 elif isinstance(other, ABCSeries) and not self._indexed_same(othe
r):
-> 1190 raise ValueError("Can only compare identically-labeled "
1191 "Series objects")
1192
ValueError: Can only compare identically-labeled Series objects
we see that this is a deliberate implementation decision. Also, this is not unique to Series objects - DataFrames raise a similar error.
Digging through the Git blame for the relevant lines eventually turns up some relevant commits and issue tracker threads. For example, Series.__eq__
used to completely ignore the RHS's index, and in a comment on a bug report about that behavior, Pandas author Wes McKinney says the following:
This is actually a feature / deliberate choice and not a bug-- it's related to #652. Back in January I changed the comparison methods to do auto-alignment, but found that it led to a large amount of bugs / breakage for users and, in particular, many NumPy functions (which regularly do things like
arr[1:] == arr[:-1]
; example:np.unique
) stopped working.This gets back to the issue that Series isn't quite ndarray-like enough and should probably not be a subclass of ndarray.
So, I haven't got a good answer for you except for that; auto-alignment would be ideal but I don't think I can do it unless I make Series not a subclass of ndarray. I think this is probably a good idea but not likely to happen until 0.9 or 0.10 (several months down the road).
This was then changed to the current behavior in pandas 0.19.0. Quoting the "what's new" page:
Following Series operators have been changed to make all operators consistent, including DataFrame (GH1134, GH4581, GH13538)
- Series comparison operators now raise ValueError when index are different.
- Series logical operators align both index of left and right hand side.
This made the Series behavior match that of DataFrame, which already rejected mismatched indices in comparisons.
In summary, making the comparison operators align indices automatically turned out to break too much stuff, so this was the best alternative.
One thing I love about python is that you can peak into source code of almost anything. And from pd.Series.eq
source code, it calls:
def flex_wrapper(self, other, level=None, fill_value=None, axis=0):
# other stuff
# ...
if isinstance(other, ABCSeries):
return self._binop(other, op, level=level, fill_value=fill_value)
and go on to pd.Series._binop
:
def _binop(self, other, func, level=None, fill_value=None):
# other stuff
# ...
if not self.index.equals(other.index):
this, other = self.align(other, level=level, join='outer',
copy=False)
new_index = this.index
That means the eq
operator aligns the two series before comparison (which, apparently, the normal operator ==
does not).
Back to 2012 , when we do not have eq
, ne
and gt
, pandas
have the problem : disorder Series
will return the unexpected output with logic (>,<,==,!=
) , so they doing with a fix (new function added, gt
,ge
,ne
..)
GitHub Ticket reference