Split a Python List into Chunks with Maximum Memory Size

The problem of optimal splitting of a sequence such that the elements satisfy a given max/min condition while keeping the order of the elements can be solved greedily. Hence, you need to iterate over the input sequence only once and maintain a buffer of elements. In Python this can be elegantly coded with a generator, which will have the advantage of not needing to create the result.

The bulk of the algorithm for your problem is as follows:

Click to copy

def split_by_size(items, max_size, get_size=len):
    buffer = []
    buffer_size = 0
    for item in items:
        item_size = get_size(item)
        if buffer_size + item_size <= max_size:
            buffer.append(item)
            buffer_size += item_size
        else:
            yield buffer
            buffer = [item]
            buffer_size = item_size
    if buffer_size > 0:
        yield buffer

where the last parameter delegates the issue of determining the size of a given item to the specified callable. I will not dwell upon this, but I will assume that a simple len() will do. Also, this assumes that each element, individually would satisfy the condition, otherwise one should handle also this case.

Testing the above code:

Click to copy

import random


k = 10
n = 15
max_size = 10

random.seed(0)
items = [b'x' * random.randint(1, 2 * k // 3) for _ in range(n)]
print(items)
# [b'xxxx', b'xxxx', b'x', b'xxx', b'xxxxx', b'xxxx', b'xxxx', b'xxx', b'xxxx', b'xxx', b'xxxxx', b'xx', b'xxxxx', b'xx', b'xxx']

print(list(split_by_size(items, k)))
# [[b'xxxx', b'xxxx', b'x'], [b'xxx', b'xxxxx'], [b'xxxx', b'xxxx'], [b'xxx', b'xxxx', b'xxx'], [b'xxxxx', b'xx'], [b'xxxxx', b'xx', b'xxx']]

Also, if you are willing to store the result of the split in a list anyway, the code for the above approach can be made slightly more compact:

Click to copy

def chunks_by_size(items, max_size, get_size=len):
    result = []
    size = max_size + 1
    for item in items:
        item_size = get_size(item)
        size += item_size
        if size > max_size:
            result.append([])
            size = item_size
        result[-1].append(item)
    return result

but also slightly slower (see benchmarks below).

You could also think of using functools.reduce() (basically the same as @NizamMohamed answer), and the code will be shorter but perhaps also less readable:

Click to copy

def chunks_by_size_reduce(items, size, get_size=len):
    return functools.reduce(
        lambda a, b, size=size:
            a[-1].append(b) or a
            if a and sum(get_size(x) for x in a[-1]) + get_size(b) <= size
            else a.append([b]) or a, items, [])

and certainly less efficient as get_size() is being called for every element of the "candidate" inner list for every element considered, which makes this O(n k!), k being the average number of elements in each sub-sequence. For some timings, see benchmarks below.

I would not be surprised to a solution using itertools.accumulate(), but that would also bound to be quite slow.

The simplest approach to speed things up would be to use Cython or Numba. Here, this was applied to split_by_size(). For both of them the code would be unchanged.

Benchmarking all this we obtain (_cy stands for the Cython-compiled version while _nb stands for the Numba-compiled version):

Click to copy

%timeit list(split_by_size(items * 100000, k + 1))
# 10 loops, best of 3: 281 ms per loop
%timeit list(split_by_size_cy(items * 100000, k + 1))
# 10 loops, best of 3: 181 ms per loop
%timeit list(split_by_size_nb(items * 100000, k + 1))
# 100 loops, best of 3: 5.17 ms per loop
%timeit chunks_by_size(items * 100000, k + 1)
# 10 loops, best of 3: 318 ms per loop
%timeit chunks_by_size_reduce(items * 100000, k + 1)
# 1 loop, best of 3: 1.18 s per loop

Note that while the Numba-compiled version is much faster than the alternatives, it is also the most brittle since it requires the forceobj flag set to True, and this may lead to unstable execution.

Anyway, I hardly believe this would be a bottleneck if the final goal is to send the grouped items through some I/O operation.

Note that the algorithm is pretty much the same as other answers, I just find the code here a bit cleaner.

This solution is with functools.reduce.

Click to copy

l = [b'abc', b'def', b'ghi', b'jklm', b'nopqrstuv', b'wx', b'yz']

reduce(lambda a, b, size=7: a[-1].append(b) or a if a and sum(len(x) for x in a[-1]) + len(b) <= size else a.append([b]) or a, l, [])

a is an empty list and b is an item from the original list.

if a and sum(len(x) for x in a[-1]) + len(b) <= size
check if a is not empty and sum of length of bytes in the last appended list and length of b is not exceeding size.

a[-1].append(b) or a
append b to the last appended list of a and return a if the condition is True.

a.append([b]) or a
make a list with b and append the new list to a and return a

Output;

Click to copy

[[b'abc', b'def'], [b'ghi', b'jklm'], [b'nopqrstuv'], [b'wx', b'yz']]

Split a Python List into Chunks with Maximum Memory Size

Tags:

Python

Python 3.X

Related

Recent Posts