Why is memory allocation on heap MUCH slower than on stack?

In your edit where you restate unwind's answer you mention the "heap data structure". Be very careful as the data structure known as a heap has no relationship to dynamic memory allocation. To be very clear, I'll use the more language lawyer terminology of free store.

As has already been pointed out, stack allocation requires incrementing a pointer, which typically has a dedicated register on most architectures and deallocation requires the same amount of work. Stack allocations are also scoped to a particular function. This makes them much better candidates for compiler optimizations like precomputing the total space needed on the stack and doing a single increment for an entire stack frame. Likewise, the stack has better guaranteed data locality. The top of the stack is almost always guaranteed to be inside a cache line, and as I mentioned already the stack pointer is typically stored in a register. Optimizing compilers on some architectures can even eliminate allocations altogether on the stack by reusing arguments from previous stack frames that are passed as arguments to called functions in deeper stack frames. Likewise stack allocated variables can often be promoted to registers avoiding allocations as well.

In contrast, the free store is much more complex. I'm not even going to begin to cover garbage collection systems as that's a whole different topic, and this question was asked about the C language. Typically allocations and deallocations from a free store involve several different data structures like a free list or block pool. These data structures and bookkeeping require memory as well, and thus that space is wasted. Furthermore, the bookkeeping records are often intermixed with the allocations and thus hurt the data locality of other allocations. Allocations from the free store may involve asking the underlying operating system for more process memory typically from some form of slab allocator.

For a simple comparison, and using jemalloc-2.2.5 and numbers from sloccount as a reference, the jemalloc implementation contains over 8,800 lines of source code in the C language and another over 700 lines of test code. This should give you a good idea of the difference in complexity between free store allocation and stack allocation: thousands of lines of C code versus a single instruction.

Additionally, since free store allocations are not limited to a single lexical scope, the lifetime of every allocation must be tracked. Likewise, these allocations may be passed across threads, and thus thread synchronization issues enter the problem space. Another big problem for free store allocation is fragmentation. Fragmentation causes many problems:

  • Fragmentation hurts data locality.
  • Fragmentation wastes memory.
  • Fragmentation makes the job of finding free space for large allocations harder.

On modern systems, stacks are often relatively small in comparison to the free store, so ultimately the free store is managing more space and thus tackling a harder problem. Also, due to the limitations on stack sizes, the free store is typically used for larger allocations, this discrepancy between having to handle both very large and very small allocations makes the job of the free store harder as well. Typically stack allocations are small on the order of a few kilobytes or less, and the total size of the stack is only a few megabytes. The free store is generally given the entire rest of process space in a program. On modern machines this can be several hundred gigabytes, and it's not uncommon for free store allocations to vary in size from a few bytes like a short string of characters to megabytes or even gigabytes of arbitrary data. This means that free store allocators have to deal with the underlying operating system's virtual memory management. Stack allocation is essentially built-in to the computer hardware.

If you want to really learn about free store allocation, I would highly recommend reading some of the many papers and articles published about various malloc implementations or even reading the code. Here's a few links to get you started:

  • dlmalloc - Doug Lea's malloc, a historical reference malloc implementation used in GNU C++ at one point in time
  • phkmalloc - FreeBSD implementation of malloc written by Poul-Henning Kamp author of the Varnish web cache
  • tcmalloc - Thread-Caching Malloc implemented by some Google developers
  • jemalloc - Jason Evan's malloc implementation for FreeBSD (successor of phkmalloc)

Here's some additional links with descriptions of the tcmalloc implementation:

  • http://jamesgolick.com/2013/5/15/memory-allocators-101.html
  • http://jamesgolick.com/2013/5/19/how-tcmalloc-works.html

The main difference between a stack and a heap is that items on a stack cannot be removed out of order. If you add items A, B, C to a stack, you can't remove B without removing C first. This means that adding a new item to a stack always means adding it to the end of the stack, which is a very simple operation. You just move the pointer that points to the end of the stack.

On a heap on the other hand, you can remove items out of order. And as long as you don't move the other items around afterwards in memory (as some garbage collected heaps do), your heap then has "hole" in the middle. I.e. if you add A,B,C to a heap and remove B, your heap looks like this in memory: A _ C where _ is a block of unused (free) memory. If you add a new item D now, the allocator has to find a continuous free space big enough to fit D. Depending on how many continuous free spaces there are in your memory, this can be an expensive operation. And it's almost always more expensive than just moving the "last element" pointer of a stack.


Because the heap is a far more complicated data structure than the stack.

For many architectures, allocating memory on the stack is just a matter of changing the stack pointer, i.e. it's one instruction. Allocating memory on the heap involves looking for a big enough block, splitting it, and managing the "book-keeping" that allows things like free() in a different order.

Memory allocated on the stack is guaranteed to be deallocated when the scope (typically the function) exits, and it's not possible to deallocate just some of it.