Understanding pandas dataframe indexing

I am pretty sure that your 1st way is returning a copy, instead of a view, and so assigning to it does not change the original data. I am not sure why this is happening though.

It seems to be related to the order in which you select rows and columns, NOT the syntax for getting columns. These both work:

df.D[df.key == 1] = 1
df['D'][df.key == 1] = 1

And neither of these works:

df[df.key == 1]['D'] = 1
df[df.key == 1].D = 1

From this evidence, I would assume that the slice df[df.key == 1] is returning a copy. But this is not the case! df[df.key == 1] = 0 will actually change the original data, as if it were a view.

So, I'm not sure. My sense is that this behavior has changed with the version of pandas. I seem to remember that df.D used to return a copy and df['D'] used to return a view, but this doesn't appear to be true anymore (pandas 0.10.0).

If you want a more complete answer, you should post in the pystatsmodels forum: https://groups.google.com/forum/?fromgroups#!forum/pystatsmodels


The pandas documentation says:

Returning a view versus a copy

The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g. df.ix[3:6] or df.ix[:, 'A'], a view will be returned.

In df[df.key==1]['D'] you first do boolean slicing (leading to a copy of the Dataframe), then you choose a column ['D'].

In df.D[df.key==1] = 3.4, you first choose a column, then do boolean slicing on the resulting Series.

This seems to make the difference, although I must admit that it is a little counterintuitive.

Edit: The difference was identified by Dougal, see his comment: With version 1, the copy is made as the __getitem__ method is called for the boolean slicing. For version 2, only the __setitem__ method is accessed - thus not returning a copy but just assigning.