In pandas, how to concatenate horizontally and then remove the redundant columns

To avoid duplication of the columns while joining two data frames use the ignore_index argument.

pd.concat([df1, df2], ignore_index=True, sort=False)

But use it only if wish to append them and ignore the fact that they may have overlapping indexes


Dropping duplicates should work. Because drop_duplicates only works with index, we need to transpose the DF to drop duplicates and transpose it back.

pd.concat([DF1, DF2], axis = 1).T.drop_duplicates().T

Use difference for columns from DF2 which are not in DF1 and simple select them by []:

DF1 = pd.DataFrame(columns=['col1', 'col2', 'col3'])
DF2 = pd.DataFrame(columns=['col2', 'col4', 'col5'])


DF2 = DF2[DF2.columns.difference(DF1.columns)]
print (DF2)
Empty DataFrame
Columns: [col4, col5]
Index: []

print (pd.concat([DF1, DF2], axis = 1))
Empty DataFrame
Columns: [col1, col2, col3, col4, col5]
Index: []

Timings:

np.random.seed(123)

N = 1000
DF1 = pd.DataFrame(np.random.rand(N,3), columns=['col1', 'col2', 'col3'])
DF2 = pd.DataFrame(np.random.rand(N,3), columns=['col2', 'col4', 'col5'])

DF2['col2'] = DF1['col2']

In [408]: %timeit (pd.concat([DF1, DF2], axis = 1).T.drop_duplicates().T)
10 loops, best of 3: 122 ms per loop

In [409]: %timeit (pd.concat([DF1, DF2[DF2.columns.difference(DF1.columns)]], axis = 1))
1000 loops, best of 3: 979 µs per loop

N = 10000:
In [411]: %timeit (pd.concat([DF1, DF2], axis = 1).T.drop_duplicates().T)
1 loop, best of 3: 1.4 s per loop

In [412]: %timeit (pd.concat([DF1, DF2[DF2.columns.difference(DF1.columns)]], axis = 1))
1000 loops, best of 3: 1.12 ms per loop

Tags:

Python

Pandas