How to concatenate multiple pandas.DataFrames without running into MemoryError
I advice you to put your dataframes into single csv file by concatenation. Then to read your csv file.
Execute that:
# write df1 content in file.csv
df1.to_csv('file.csv', index=False)
# append df2 content to file.csv
df2.to_csv('file.csv', mode='a', columns=False, index=False)
# append df3 content to file.csv
df3.to_csv('file.csv', mode='a', columns=False, index=False)
# free memory
del df1, df2, df3
# read all df1, df2, df3 contents
df = pd.read_csv('file.csv')
If this solution isn't enougth performante, to concat larger files than usually. Do:
df1.to_csv('file.csv', index=False)
df2.to_csv('file1.csv', index=False)
df3.to_csv('file2.csv', index=False)
del df1, df2, df3
Then run bash command:
cat file1.csv >> file.csv
cat file2.csv >> file.csv
cat file3.csv >> file.csv
Or concat csv files in python :
def concat(file1, file2):
with open(file2, 'r') as filename2:
data = file2.read()
with open(file1, 'a') as filename1:
file.write(data)
concat('file.csv', 'file1.csv')
concat('file.csv', 'file2.csv')
concat('file.csv', 'file3.csv')
After read:
df = pd.read_csv('file.csv')
The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.
With such huge data, performance is an issue.
csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are shorter, more elegant and faster since using binary mode. I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.
Here the code:
import numpy as np
import pandas as pd
import os
import pickle
# a DataFrame factory:
dfs=[]
for i in range(10):
dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))
# a csv solution
def bycsv(dfs):
md,hd='w',True
for df in dfs:
df.to_csv('df_all.csv',mode=md,header=hd,index=None)
md,hd='a',False
#del dfs
df_all=pd.read_csv('df_all.csv',index_col=None)
os.remove('df_all.csv')
return df_all
Better solutions :
def byHDF(dfs):
store=pd.HDFStore('df_all.h5')
for df in dfs:
store.append('df',df,data_columns=list('0123'))
#del dfs
df=store.select('df')
store.close()
os.remove('df_all.h5')
return df
def bypickle(dfs):
c=[]
with open('df_all.pkl','ab') as f:
for df in dfs:
pickle.dump(df,f)
c.append(len(df))
#del dfs
with open('df_all.pkl','rb') as f:
df_all=pickle.load(f)
offset=len(df_all)
df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))
for size in c[1:]:
df=pickle.load(f)
df_all.iloc[offset:offset+size]=df.values
offset+=size
os.remove('df_all.pkl')
return df_all
For homogeneous dataframes, we can do even better :
def byhand(dfs):
mtot=0
with open('df_all.bin','wb') as f:
for df in dfs:
m,n =df.shape
mtot += m
f.write(df.values.tobytes())
typ=df.values.dtype
#del dfs
with open('df_all.bin','rb') as f:
buffer=f.read()
data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
df_all=pd.DataFrame(data=data,columns=list(range(n)))
os.remove('df_all.bin')
return df_all
And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.
In [92]: %time w=bycsv(dfs)
Wall time: 8.06 s
In [93]: %time x=byHDF(dfs)
Wall time: 547 ms
In [94]: %time v=bypickle(dfs)
Wall time: 219 ms
In [95]: %time y=byhand(dfs)
Wall time: 109 ms
A check :
In [195]: (x.values==w.values).all()
Out[195]: True
In [196]: (x.values==v.values).all()
Out[196]: True
In [197]: (x.values==y.values).all()
Out[196]: True
Of course all of that must be improved and tuned to fit your problem.
For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle
.
I can edit it if you give more information on your data structure and size if you want. Beautiful question !