Why protobuf is smaller in memory than normal dict+list in python?
"Simple" python objects, such as int
or float
, need much more memory than their C-counterparts used by protobuf
.
Let's take a list
of Python integers as example compared to an array of integers, as for example in an array.array
(i.e. array.array('i', ...)
).
The analysis for array.array
is simple: discarding some overhead from the array.arrays
-object, only 4 bytes (size of a C-integer) are needed per element.
The situation is completely different for a list of integers:
- the list holds not the integer-objects themselves but pointers to the objects (
8
additional bytes for a 64bit executable) - even a small non-zero integer needs at least
28
bytes (seeimport sys; sys.getsizeof(1)
returns 28): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the integer-function table, 8 bytes are needed for the size of the integer value (Python's integers can be much bigger than 2^32), and at least 4 byte to hold the integer value itself. - there is also an overhead for memory management of 4.5 bytes.
This means there is a whopping cost of 40.5 bytes per Python integer compared to the possible 4 bytes (or 8 bytes if we use long long int
, i.e. 64bit integers).
A situation is similar for a list with Python floats compared to an array of doubles
( i.e. array.array('d',...)
), which only needs about 8 bytes per element. But for list we have:
- the list holds not the float objects themselves but pointers to the objects (
8
additional bytes for a 64bit executable) - a float object needs
24
bytes (seeimport sys; sys.getsizeof(1.0)
returns 24): 8 bytes are needed for reference counting, 8 bytes to hold a pointer to the float-function table, and 8 bytes to hold thedouble
-value itself. - because 24 is a multiple of 8, the overhead for memory management is "only" about 0.5 bytes.
Which means 32.5 bytes for a Python float object vs. 8 byte for a C-double.
protobuf
uses internally the same representation of the data as array.array
and thus needs much less memory (about 4-5 times less, as you observe). numpy.array
is another example for a data type, which holds raw C-values and thus needs much less memory than lists.
If one doesn't need to search in a dictionary, then saving the key-values-pairs in a list will need less memory than in a dictionary, because one doesn't have to maintain a structure for searching (which imposes some memory costs) - this is also another thing that leads to smaller memory footprint of protobuf
-data.
To answer your other question: There are no built-in modules which are to Python-dict
, what array.array
are to Python-list
, so I use this opportunity to shamelessly plug-in an advertisement for a library of mine: cykhash
.
Sets and maps from cykhash
need less than 25% of Python'S-dict
/set
memory but are about the same fast.