How to iterate over consecutive chunks of Pandas dataframe efficiently
Use numpy's array_split():
import numpy as np
import pandas as pd
data = pd.DataFrame(np.random.rand(10, 3))
for chunk in np.array_split(data, 5):
assert len(chunk) == len(data) / 5, "This assert may fail for the last chunk if data lenght isn't divisible by 5"
I'm not sure if this is exactly what you want, but I found these grouper functions on another SO thread fairly useful for doing a multiprocessor pool.
Here's a short example from that thread, which might do something like what you want:
import numpy as np
import pandas as pds
df = pds.DataFrame(np.random.rand(14,4), columns=['a', 'b', 'c', 'd'])
def chunker(seq, size):
return (seq[pos:pos + size] for pos in xrange(0, len(seq), size))
for i in chunker(df,5):
print i
Which gives you something like this:
a b c d
0 0.860574 0.059326 0.339192 0.786399
1 0.029196 0.395613 0.524240 0.380265
2 0.235759 0.164282 0.350042 0.877004
3 0.545394 0.881960 0.994079 0.721279
4 0.584504 0.648308 0.655147 0.511390
a b c d
5 0.276160 0.982803 0.451825 0.845363
6 0.728453 0.246870 0.515770 0.343479
7 0.971947 0.278430 0.006910 0.888512
8 0.044888 0.875791 0.842361 0.890675
9 0.200563 0.246080 0.333202 0.574488
a b c d
10 0.971125 0.106790 0.274001 0.960579
11 0.722224 0.575325 0.465267 0.258976
12 0.574039 0.258625 0.469209 0.886768
13 0.915423 0.713076 0.073338 0.622967
I hope that helps.
EDIT
In this case, I used this function with pool of processors in (approximately) this manner:
from multiprocessing import Pool
nprocs = 4
pool = Pool(nprocs)
for chunk in chunker(df, nprocs):
data = pool.map(myfunction, chunk)
data.domorestuff()
I assume this should be very similar to using the IPython distributed machinery, but I haven't tried it.