Limiting Memory Use in a *Large* Django QuerySet
So what I actually ended up doing is building something that you can 'wrap' a QuerySet in. It works by making a deepcopy of the QuerySet, using the slice syntax--e.g., some_queryset[15:45]
--but then it makes another deepcopy of the original QuerySet when the slice has been completely iterated through. This means that only the set of Objects returned in 'this' particular slice are stored in memory.
class MemorySavingQuerysetIterator(object):
def __init__(self,queryset,max_obj_num=1000):
self._base_queryset = queryset
self._generator = self._setup()
self.max_obj_num = max_obj_num
def _setup(self):
for i in xrange(0,self._base_queryset.count(),self.max_obj_num):
# By making a copy of of the queryset and using that to actually access
# the objects we ensure that there are only `max_obj_num` objects in
# memory at any given time
smaller_queryset = copy.deepcopy(self._base_queryset)[i:i+self.max_obj_num]
logger.debug('Grabbing next %s objects from DB' % self.max_obj_num)
for obj in smaller_queryset.iterator():
yield obj
def __iter__(self):
return self
def next(self):
return self._generator.next()
So instead of...
for obj in SomeObject.objects.filter(foo='bar'): <-- Something that returns *a lot* of Objects
do_something(obj);
You would do...
for obj in MemorySavingQuerysetIterator(in SomeObject.objects.filter(foo='bar')):
do_something(obj);
Please note that the intention of this is to save memory in your Python interpreter. It essentially does this by making more database queries. Usually people are trying to do the exact opposite of that--i.e., minimize database queries as much as possible without regards to memory usage. Hopefully somebody will find this useful though.
What about using django core's Paginator and Page objects documented here:
https://docs.djangoproject.com/en/dev/topics/pagination/
Something like this:
from django.core.paginator import Paginator
from djangoapp.models import SomeModel
paginator = Paginator(SomeModel.objects.all(), 1000) # chunks of 1000
for page_idx in range(1, paginator.num_pages):
for row in paginator.page(page_idx).object_list:
# here you can do what you want with the row
print "done processing page %s" % page_idx
You can't use Model.objects.all().iterator()
because it will fetch all the elements on your table at once. Neither can you use Model.objects.all()[offset:offset+pagesize]
, because it will cache the results. Either will exceed your memory limit.
I've tried to mix both solutions, and it worked:
offset = 0
pagesize = 1000
count = Model.objects.all().count()
while offset < count:
for m in Model.objects.all()[offset : offset + pagesize].iterator:
do_something with m
offset += pagesize
Change pagesize
to fit your requirements, and optionally change the [offset : offset + pagesize]
to the [offset * pagesize : (offset + 1) * pagesize]
idiom if it fits you better. Also, of course, replace Model
by your actual model name.