Pandas DataFrame mutability
Great question, thanks. I ended up with playing around a bit after reading the other answers. So I want to share this with you.
Here some code for playing around:
import pandas as pd
import numpy as np
df=pd.DataFrame([[1,2,3],[4,5,6]])
print('start',df,sep='\n',end='\n\n')
def testAddCol(df):
df=pd.DataFrame(df, copy=True) #experiment in this line: df=df.copy(), df=df.iloc[:2,:2], df.iloc[:2,:2].copy(), nothing, ...
df['newCol']=11
df.iloc[0,0]=100
return df
df2=testAddCol(df)
print('df',df,sep='\n',end='\n\n')
print('df2',df2,sep='\n',end='\n\n')
output:
start
0 1 2
0 1 2 3
1 4 5 6
df
0 1 2
0 1 2 3
1 4 5 6
df2
0 1 2 newCol
0 100 2 3 11
1 4 5 6 11
This:
df2 = pd.DataFrame(df1)
Constructs a new DataFrame. There is a copy
parameter whose default argument is False
. According to the documentation, it means:
> Copy data from inputs. Only affects DataFrame / 2d ndarray input
So data will be shared between df2
and df1
by default. If you want there to be no sharing, but rather a complete copy, do this:
df2 = pd.DataFrame(df1, copy=True)
Or more concisely and idiomatically:
df2 = df1.copy()
If you do this:
df2 = df1.iloc[2:3,1:2].copy()
You will again get an independent copy. But if you do this:
df2 = pd.DataFrame(df1.iloc[2:3,1:2])
It will probably share the data, but this style is pretty unclear if you intend to modify df
, so I suggest not writing such code. Instead, if you want no copy, just say this:
df2 = df1.iloc[2:3,1:2]
In summary: if you want a reference to existing data, do not call pd.DataFrame()
or any other method at all. If you want an independent copy, call .copy()
.
It will probably share the data, but this style is pretty unclear if you intend to modify df, so I suggest not writing such code. Instead, if you want no copy, just say this:
df2 = df1.iloc[2:3,1:2]
In summary: if you want a reference to existing data, do not call > pd.DataFrame() or any other method at all. If you want an independent copy, call .copy()
I do not agree. Doing the above would still return a reference to the sliced section of the original DataFrame. So, if you make any changes to df2, it will reflect in df1.
Rather the .copy() should be used,
df2 = df1.iloc[2:3,1:2].copy()