Pandas Merge - How to avoid duplicating columns
You can work out the columns that are only in one DataFrame and use this to select a subset of columns in the merge.
cols_to_use = df2.columns.difference(df.columns)
Then perform the merge (note this is an index object but it has a handy tolist()
method).
dfNew = merge(df, df2[cols_to_use], left_index=True, right_index=True, how='outer')
This will avoid any columns clashing in the merge.
I use the suffixes
option in .merge()
:
dfNew = df.merge(df2, left_index=True, right_index=True,
how='outer', suffixes=('', '_y'))
dfNew.drop(dfNew.filter(regex='_y$').columns, axis=1, inplace=True)
Thanks @ijoseph
Building on @rprog's answer, you can combine the various pieces of the suffix & filter step into one line using a negative regex:
dfNew = df.merge(df2, left_index=True, right_index=True,
how='outer', suffixes=('', '_DROP')).filter(regex='^(?!.*_DROP)')
Or using df.join
:
dfNew = df.join(df2, lsuffix="DROP").filter(regex="^(?!.*DROP)")
The regex here is keeping anything that does not end with the word "DROP", so just make sure to use a suffix that doesn't appear among the columns already.
I'm freshly new with Pandas but I wanted to achieve the same thing, automatically avoiding column names with _x or _y and removing duplicate data. I finally did it by using this answer and this one from Stackoverflow
sales.csv
city;state;units Mendocino;CA;1 Denver;CO;4 Austin;TX;2
revenue.csv
branch_id;city;revenue;state_id 10;Austin;100;TX 20;Austin;83;TX 30;Austin;4;TX 47;Austin;200;TX 20;Denver;83;CO 30;Springfield;4;I
merge.py import pandas
def drop_y(df):
# list comprehension of the cols that end with '_y'
to_drop = [x for x in df if x.endswith('_y')]
df.drop(to_drop, axis=1, inplace=True)
sales = pandas.read_csv('data/sales.csv', delimiter=';')
revenue = pandas.read_csv('data/revenue.csv', delimiter=';')
result = pandas.merge(sales, revenue, how='inner', left_on=['state'], right_on=['state_id'], suffixes=('', '_y'))
drop_y(result)
result.to_csv('results/output.csv', index=True, index_label='id', sep=';')
When executing the merge command I replace the _x
suffix with an empty string and them I can remove columns ending with _y
output.csv
id;city;state;units;branch_id;revenue;state_id 0;Denver;CO;4;20;83;CO 1;Austin;TX;2;10;100;TX 2;Austin;TX;2;20;83;TX 3;Austin;TX;2;30;4;TX 4;Austin;TX;2;47;200;TX