fastest way to create JSON to reflect a tree structure in Python / Django using mptt
I suspect by far the biggest slowdown is that this will do 1 database query per node. The json rendering is trivial in comparison to the hundreds of round-trips to your database.
You should cache the children on each node so that those queries can be done all at once. django-mptt has a cache_tree_children() function you can do this with.
import json
from mptt.templatetags.mptt_tags import cache_tree_children
def recursive_node_to_dict(node):
result = {
'id': node.pk,
'name': node.name,
}
children = [recursive_node_to_dict(c) for c in node.get_children()]
if children:
result['children'] = children
return result
root_nodes = cache_tree_children(Node.objects.all())
dicts = []
for n in root_nodes:
dicts.append(recursive_node_to_dict(n))
print json.dumps(dicts, indent=4)
Custom json encoding, while it might provide a slight speedup in some scenarios, is something I'd highly discourage, as it will be a lot of code, and it's something that's easy to get very wrong.
Your updated version looks like there would be very little overhead. I think it would be slightly more efficient (and more readable, too!) to use a list comprehension:
def serializable_object(node):
"Recurse into tree to build a serializable object"
obj = {
'name': node.name,
'children': [serializable_object(ch) for ch in node.get_children()]
}
return obj
Besides that, all you can do is profile it to find the bottlenecks. Write some standalone code that loads and serializes your 300 nodes and then run it with
python -m profile serialize_benchmark.py
(or -m cProfile
if that works better).
The can see 3 different potential bottlenecks:
- DB access (
.get_children()
and.name
) -- I'm not sure exactly what's going on under the hood, but I've had code like this that does a DB query for each node adding a tremendous overhead. If that's your problem, you can probably configure this to do an "eager load" using select_related or something similar. - function call overhead (e.g.
serializable_object
itself) -- Just make sure ncalls forserializable_object
looks like a reasonable number. If I understand your description, it should be in the neighborhood of 300. - serializing at the end (
json.dumps(nodeInstance)
) -- Not a likely culprit since you said it's only 300 nodes, but if you do see this taking up a lot of execution time, make sure you have the compiled speedups for JSON working properly.
If you can't tell much from profiling it, make a stripped-down version that, say, recursively calls node.name
and node.get_children()
but doesn't store the results in a data structure, and see how that compares.
Update:
There are 2192 calls to execute_sql
in solution 3 and 2192 in solution 5, so I think that excessive DB queries is a problem and that select_related
did nothing the way it's used above. Looking at django-mptt issue #88: Allow select_related in model methods suggests that you're using it more-or-less right, but I have my doubt, and get_children
vs. get_descendants
might make a huge difference.
There's also a ton of time being taken up by copy.deepcopy
, which is puzzling because you're not directly calling it, and I don't see it called from the MPTT code. What's tree.py?
If you're doing a lot of work with profiling, I'd highly recommend the really slick tool RunSnakeRun, which lets you see your profile data in a really convenient grid form and make sense of the data more quickly.
Anyway, here's one more attempt at streamlining the DB side of things:
import weakref
obj_cache = weakref.WeakValueDictionary()
def serializable_object(node):
root_obj = {'name': node.get_wbs_code(), 'wbsCode': node.get_wbs_code(),
'id': node.pk, 'level': node.level, 'position': node.position,
'children': []}
obj_cache[node.pk] = root_obj
# don't know if the following .select_related() does anything...
for descendant in node.get_descendants().select_related():
# get_descendants supposedly traverses in "tree order", which I think
# means the parent obj will always be created already
parent_obj = obj_cache[descendant.parent.pk] # hope parent is cached
descendant_obj = {'name': descendant.get_wbs_code(),
'wbsCode': descendant.get_wbs_code(), 'id': descendant.pk,
'level': descendant.level, 'position': descendant.position,
'children': []}
parent_obj['children'].append(descendant_obj)
obj_cache[descendant.pk] = descendant_obj
return root_obj
Note this is no longer recursive. It proceeds iteratively through nodes, theoretically after their parents have been visited, and it's all using one big call to MPTTModel.get_descendants()
, so hopefully that's well-optimized and caches .parent
, etc. (or maybe there's a more direct way to get at the parent key?). It creates each obj with no children initially, then "grafts" all the values to their parents afterwards.