Split a Python List into Chunks with Maximum Memory Size
The problem of optimal splitting of a sequence such that the elements satisfy a given max/min condition while keeping the order of the elements can be solved greedily. Hence, you need to iterate over the input sequence only once and maintain a buffer of elements. In Python this can be elegantly coded with a generator, which will have the advantage of not needing to create the result.
The bulk of the algorithm for your problem is as follows:
def split_by_size(items, max_size, get_size=len):
buffer = []
buffer_size = 0
for item in items:
item_size = get_size(item)
if buffer_size + item_size <= max_size:
buffer.append(item)
buffer_size += item_size
else:
yield buffer
buffer = [item]
buffer_size = item_size
if buffer_size > 0:
yield buffer
where the last parameter delegates the issue of determining the size of a given item to the specified callable.
I will not dwell upon this, but I will assume that a simple len()
will do.
Also, this assumes that each element, individually would satisfy the condition, otherwise one should handle also this case.
Testing the above code:
import random
k = 10
n = 15
max_size = 10
random.seed(0)
items = [b'x' * random.randint(1, 2 * k // 3) for _ in range(n)]
print(items)
# [b'xxxx', b'xxxx', b'x', b'xxx', b'xxxxx', b'xxxx', b'xxxx', b'xxx', b'xxxx', b'xxx', b'xxxxx', b'xx', b'xxxxx', b'xx', b'xxx']
print(list(split_by_size(items, k)))
# [[b'xxxx', b'xxxx', b'x'], [b'xxx', b'xxxxx'], [b'xxxx', b'xxxx'], [b'xxx', b'xxxx', b'xxx'], [b'xxxxx', b'xx'], [b'xxxxx', b'xx', b'xxx']]
Also, if you are willing to store the result of the split in a list
anyway, the code for the above approach can be made slightly more compact:
def chunks_by_size(items, max_size, get_size=len):
result = []
size = max_size + 1
for item in items:
item_size = get_size(item)
size += item_size
if size > max_size:
result.append([])
size = item_size
result[-1].append(item)
return result
but also slightly slower (see benchmarks below).
You could also think of using functools.reduce()
(basically the same as @NizamMohamed answer), and the code will be shorter but perhaps also less readable:
def chunks_by_size_reduce(items, size, get_size=len):
return functools.reduce(
lambda a, b, size=size:
a[-1].append(b) or a
if a and sum(get_size(x) for x in a[-1]) + get_size(b) <= size
else a.append([b]) or a, items, [])
and certainly less efficient as get_size()
is being called for every element of the "candidate" inner list for every element considered, which makes this O(n k!)
, k
being the average number of elements in each sub-sequence. For some timings, see benchmarks below.
I would not be surprised to a solution using itertools.accumulate()
, but that would also bound to be quite slow.
The simplest approach to speed things up would be to use Cython or Numba.
Here, this was applied to split_by_size()
.
For both of them the code would be unchanged.
Benchmarking all this we obtain (_cy
stands for the Cython-compiled version while _nb
stands for the Numba-compiled version):
%timeit list(split_by_size(items * 100000, k + 1))
# 10 loops, best of 3: 281 ms per loop
%timeit list(split_by_size_cy(items * 100000, k + 1))
# 10 loops, best of 3: 181 ms per loop
%timeit list(split_by_size_nb(items * 100000, k + 1))
# 100 loops, best of 3: 5.17 ms per loop
%timeit chunks_by_size(items * 100000, k + 1)
# 10 loops, best of 3: 318 ms per loop
%timeit chunks_by_size_reduce(items * 100000, k + 1)
# 1 loop, best of 3: 1.18 s per loop
Note that while the Numba-compiled version is much faster than the alternatives, it is also the most brittle since it requires the forceobj
flag set to True
, and this may lead to unstable execution.
Anyway, I hardly believe this would be a bottleneck if the final goal is to send the grouped items through some I/O operation.
Note that the algorithm is pretty much the same as other answers, I just find the code here a bit cleaner.
This solution is with functools.reduce
.
l = [b'abc', b'def', b'ghi', b'jklm', b'nopqrstuv', b'wx', b'yz']
reduce(lambda a, b, size=7: a[-1].append(b) or a if a and sum(len(x) for x in a[-1]) + len(b) <= size else a.append([b]) or a, l, [])
a
is an empty list
and b
is an item from the original list
.
if a and sum(len(x) for x in a[-1]) + len(b) <= size
check if a
is not empty and sum of length of bytes
in the last appended list
and length of b
is not exceeding size
.
a[-1].append(b) or a
append b
to the last appended list
of a
and return a
if the condition is True
.
a.append([b]) or a
make a list
with b
and append the new list
to a
and return a
Output;
[[b'abc', b'def'], [b'ghi', b'jklm'], [b'nopqrstuv'], [b'wx', b'yz']]