Strategy for partitioning dask dataframes efficiently
After discussion with mrocklin a decent strategy for partitioning is to aim for 100MB partition sizes guided by df.memory_usage().sum().compute()
. With datasets that fit in RAM the additional work this might involve can be mitigated with use of df.persist()
placed at relevant points.
As of Dask 2.0.0 you may call .repartition(partition_size="100MB")
.
This method performs an object-considerate (.memory_usage(deep=True)
) breakdown of partition size. It will join smaller partitions, or split partitions that have grown too large.
Dask's Documentation also outlines the usage.
Just to add to Samantha Hughes' answer:
memory_usage()
by default ignores memory consumption of object dtype columns. For the datasets I have been working with recently this leads to an underestimate of memory usage of about 10x.
Unless you are sure there are no object dtype columns I would suggest specifying deep=True
, that is, repartition using:
df.repartition(npartitions= 1+df.memory_usage(deep=True).sum().compute() // n )
Where n
is your target partition size in bytes. Adding 1 ensures the number of partitions is always greater than 1 (//
performs floor division).