Is string internally stored as individual characters, each character in memory shared by other similar strings?
How strings are stored is an implementation detail, but in practice, on the CPython reference interpreter, they're stored as a C-style array of characters. So if the R
is at address x
, then O
is at x+1
(or +2
or +4
, depending on the largest ordinal value in the string), and B
is at x+2
(or +4
or +8
). Because the letters are stored consecutively, knowing where R
is (and a flag in the str
that says how big each character's storage is) is enough to locate O
and B
.
'BOB'
is at a completely different address, y
, and its O
and B
are contiguous as well. The OB
in 'ROB'
is utterly unrelated to the OB
in 'BOB'
.
There is a confusing aspect to this. If you index into the strings, and check the id
of the result, it will seem like 'O'
has the same address in both strings. But that's only because:
- Indexing into a string returns a new string, unrelated to the one being indexed, and
- CPython caches length one strings in the latin-1 range, so
'O'
is a singleton (no matter how you make it, you get back the cached string)
I'll note that the actual str
internals in modern Python are even more complicated than I covered above; a single string might store the same data in up to three different encodings in the same object (the canonical form, and cached version(s) for working with specific Python C APIs). It's not visible from the Python level aside from checking the size with sys.getsizeof
though, so it's not worth worrying about in general.
If you really want to head off into the weeds, feel free to read PEP 393: Flexible String Representation which elaborates on the internals of the new str
object structure adopted in CPython 3.3.