Fastest way to write large CSV with Python
Removing all unnecessary stuff, and therefore it should be faster and easier to understand:
import random
import uuid
outfile = 'data.csv'
outsize = 1024 * 1024 * 1024 # 1GB
with open(outfile, 'ab') as csvfile:
size = 0
while size < outsize:
txt = '%s,%.6f,%.6f,%i\n' % (uuid.uuid4(), random.random()*50, random.random()*50, random.randrange(1000))
size += len(txt)
csvfile.write(txt)
The problem appears to be mainly IO-bound. You can improve the I/O a bit by writing to the file in larger chunks instead of writing one line at a time:
import numpy as np
import uuid
import os
outfile = 'data-alt.csv'
outsize = 10 # MB
chunksize = 1000
with open(outfile, 'ab') as csvfile:
while (os.path.getsize(outfile)//1024**2) < outsize:
data = [[uuid.uuid4() for i in range(chunksize)],
np.random.random(chunksize)*50,
np.random.random(chunksize)*50,
np.random.randint(1000, size=(chunksize,))]
csvfile.writelines(['%s,%.6f,%.6f,%i\n' % row for row in zip(*data)])
You can experiment with the chunksize (the number of rows written per chunk) to see what works best on your machine.
Here is a benchmark, comparing the above code to your original code, with outsize
set to 10 MB:
% time original.py
real 0m5.379s
user 0m4.839s
sys 0m0.538s
% time write_in_chunks.py
real 0m4.205s
user 0m3.850s
sys 0m0.351s
So this is is about 25% faster than the original code.
PS. I tried replacing the calls to os.path.getsize
with an estimation of the number of total lines needed. Unfortunately, it did not improve the speed. Since the number of bytes needed to represent the final int varies, the estimation also is inexact -- that is, it does not perfectly replicate the behavior of your original code. So I left the os.path.getsize
in place.