Why does an empty string in Python sometimes take up 49 bytes and sometimes 51?
https://docs.python.org/3.5/library/sys.html#sys.getsizeof
sys
is system specific so it can easily differ. This is often overlooked by everyone. All system specific stuff in python has been dumped in the sys
package for years. For e.g sys.getwindowsversion()
is not portable by definition but it's there. It like the bottomless pit of rejects in the perfect world of cross platform coding. What you see is one of the interesting nuggets of Python.
from getsizeof
docs:
Only the memory consumption directly attributed to the object is accounted for, not the memory consumption of objects it refers to.
getsizeof()
calls the object’s__sizeof__
method and adds an additional garbage collector overhead if the object is managed by the garbage collector.
When Garbage collection is in use the OS will add those extra bits. If you read Python and GC Q & A When are objects garbage collected in python? the folks have gone into excruciating detail expounding the GC and how it will affect the memory/refcount and bits blah blah.
I hope that explains where this coming from. If you don't use system
level attributes but more pythonic attributes then you will get consistent sizes.
This sounds like something is accessing the deprecated Py_UNICODE API.
As of CPython 3.7, the way the CPython Unicode representation works out, an empty string is normally stored in "compact ASCII" representation, and the base data and padding for a compact ASCII string on a 64-bit build works out to 48 bytes, plus one byte of string data (just the null terminator). You can see the relevant header file here.
For now (this is scheduled for removal in 3.12), there is also a deprecated Py_UNICODE API that stores an auxiliary wchar_t representation of the string. On a platform with 2-byte wchar_t, the wchar_t representation of an empty string is 2 bytes (just the null terminator again). The Py_UNICODE API caches this representation on the string object on first access, and str.__sizeof__
accounts for this extra data when it exists, resulting in a 51-byte total.
(If you need a wchar_t representation of a string, the non-deprecated way to get one is to use PyUnicode_AsWideChar
or PyUnicode_AsWideCharString
. These functions are not scheduled for removal, and do not attach any data to the string object.)