pandas DataFrame concat / update ("upsert")?
One solution is to conatenate df1
with new rows in df2
(i.e. where the index does not match). Then update the values with those from df2
.
df = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df.update(df2)
>>> df
A B
2015-10-01 A1 B1
2015-10-02 a1 b1
2015-10-03 a2 b2
2015-10-04 a3 b3
EDIT: Per the suggestion of @chrisb, this can further be simplified as follows:
pd.concat([df1[~df1.index.isin(df2.index)], df2])
Thanks Chris!
In addition to the correct answer, watch out for columns that do not exist in both dataframes:
df1 = pd.DataFrame([['test',1, True], ['test2',2, True]]).set_index(0)
df2 = pd.DataFrame([['test2',4], ['test3',3]]).set_index(0)
If you just use the aforementioned solution as-is, you get:
>>> 1 2
0
test 1 True
test2 4 NaN
test3 3 NaN
But if you are expecting the following output:
>>> 1 2
0
test 1 True
test2 4 True
test3 3 NaN
Just change the statement to:
df1 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df1.update(df2)
As of pandas 1.0.3
, the desired functionality is directly given by combine_first
:
combined = df2.combine_first(df1)
print(combined)
# A B
# 2015-10-01 A1 B1
# 2015-10-02 a1 b1
# 2015-10-03 a2 b2
# 2015-10-04 a3 b3
To get this behaviour, the dataframe whose data has priority (the updating one, in this case df2
) must be the one calling the function.
It basically: (1) harmonizes rows and columns, (2) gives priority to non-NaN data, and (3) if datapoints defined in both dataframes, gives priority to data in df2
, which is essentially what you want.
EDIT: My understanding is that combine_first
does fulfill the requested "update-if-present-insert-if-absent" behaviour. However, according to Vijchti in the comments (thanks), this does not strictly correspond to an SQL UPSERT manipulation, as the logic is applied value-by-value instead of the entire row. I removed any reference to UPSERT from my answer.