Best way to process database in chunks with Django QuerySet?
From your description you don't actually care about the sort order of the rows you process. If you have primary keys in your tables (which I expect!), this crude method of partitioning would be much faster:
SELECT * FROM tbl WHERE id BETWEEN 0 AND 1000;
SELECT * FROM tbl WHERE id BETWEEN 1001 AND 2000;
...
This performs the same for any offset and (almost) the same for any size of table. Retrieve min and max of your primary key and partition accordingly:
SELECT min(id), max(id) from tbl; -- then divide in suitable chunks
As opposed to:
SELECT * FROM tbl ORDER BY id LIMIT 1000;
SELECT * FROM tbl ORDER BY id LIMIT 1000 OFFSET 1000;
...
This is generally slower because all rows have to be sorted and performance degrades additionally with higher offsets and bigger tables.
The following code implements Erwin's answer above (using BETWEEN
) for a Django QuerySet:
A utility function that will do this for an arbitrary Django QuerySet is as follows. It defaults to assuming 'id' is a suitable field to use for the between
clause.
def chunked_queryset(qs, batch_size, index='id'):
"""
Yields a queryset split into batches of maximum size 'batch_size'.
Any ordering on the queryset is discarded.
"""
qs = qs.order_by() # clear ordering
min_max = qs.aggregate(min=models.Min(index), max=models.Max(index))
min_id, max_id = min_max['min'], min_max['max']
for i in range(min_id, max_id + 1, batch_size):
filter_args = {'{0}__range'.format(index): (i, i + batch_size - 1)}
yield qs.filter(**filter_args)
It would be used like this:
for chunk in chunked_queryset(SomeModel.objects.all(), 20):
# `chunk` is a queryset
for item in chunk:
# `item` is a SomeModel instance
pass
You could also change the interface so that you didn't need the extra nested loop, but could do for item in chunked_queryset(qs)
:
def chunked_queryset(qs, batch_size, index='id'):
"""
Yields a queryset that will be evaluated in batches
"""
qs = qs.order_by() # clear ordering
min_max = qs.aggregate(min=models.Min(index), max=models.Max(index))
min_id, max_id = min_max['min'], min_max['max']
for i in range(min_id, max_id + 1, batch_size):
filter_args = {'{0}__range'.format(index): (i, i + batch_size - 1)}
for item in qs.filter(**filter_args):
yield item