Python: pandas merge multiple dataframes
functools.reduce and pd.concat are good solutions but in term of execution time pd.concat is the best.
from functools import reduce
import pandas as pd
dfs = [df1, df2, df3, ...]
nan_value = 0
# solution 1 (fast)
result_1 = pd.concat(dfs, join='outer', axis=1).fillna(nan_value)
# solution 2
result_2 = reduce(lambda df_left,df_right: pd.merge(df_left, df_right,
left_index=True, right_index=True,
how='outer'),
dfs).fillna(nan_value)
Looks like the data has the same columns, so you can:
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
merged_df = pd.concat([df1, df2])
Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.
Just simply merge with DATE as the index and merge using OUTER method (to get all the data).
import pandas as pd
from functools import reduce
df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')
Now, basically load all the files you have as data frame into a list. And, then merge the files using merge
or reduce
function.
# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]
Note: you can add as many data-frames inside the above list. This is the good part about this method. No complex queries involved.
To keep the values that belong to the same date you need to merge it on the DATE
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames)
# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames).fillna('void')
- Now, the output will the values from the same date on the same lines.
- You can fill the non existing data from different frames for different columns using fillna().
Then write the merged data to the csv file if desired.
pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)
This should give you
DATE VALUE1 VALUE2 VALUE3 ....
There are 2 solutions for this, but it return all columns separately:
import functools
dfs = [df1, df2, df3]
df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='date'), dfs)
print (df_final)
date a_x b_x a_y b_y c_x a b c_y
0 May 15,2017 900.00 0.2% 1,900.00 1000000 0.2% 2,900.00 2000000 0.2%
k = np.arange(len(dfs)).astype(str)
df = pd.concat([x.set_index('date') for x in dfs], axis=1, join='inner', keys=k)
df.columns = df.columns.map('_'.join)
print (df)
0_a 0_b 1_a 1_b 1_c 2_a 2_b 2_c
date
May 15,2017 900.00 0.2% 1,900.00 1000000 0.2% 2,900.00 2000000 0.2%