Confused why after 2nd evaluation of += operator of immutable string does not change the id in Python3
This is only possible due to a weird, slightly-sketchy optimization for string concatenation in the bytecode evaluation loop. The INPLACE_ADD
implementation special-cases two string objects:
case TARGET(INPLACE_ADD): {
PyObject *right = POP();
PyObject *left = TOP();
PyObject *sum;
if (PyUnicode_CheckExact(left) && PyUnicode_CheckExact(right)) {
sum = unicode_concatenate(tstate, left, right, f, next_instr);
/* unicode_concatenate consumed the ref to left */
}
else {
...
and calls a unicode_concatenate
helper that delegates to PyUnicode_Append
, which tries to mutate the original string in-place:
void
PyUnicode_Append(PyObject **p_left, PyObject *right)
{
...
if (unicode_modifiable(left)
&& PyUnicode_CheckExact(right)
&& PyUnicode_KIND(right) <= PyUnicode_KIND(left)
/* Don't resize for ascii += latin1. Convert ascii to latin1 requires
to change the structure size, but characters are stored just after
the structure, and so it requires to move all characters which is
not so different than duplicating the string. */
&& !(PyUnicode_IS_ASCII(left) && !PyUnicode_IS_ASCII(right)))
{
/* append inplace */
if (unicode_resize(p_left, new_len) != 0)
goto error;
/* copy 'right' into the newly allocated area of 'left' */
_PyUnicode_FastCopyCharacters(*p_left, left_len, right, 0, right_len);
}
...
The optimization only happens if unicode_concatenate
can guarantee there are no other references to the LHS. Your initial a="d"
had other references, since Python uses a cache of 1-character strings in the Latin-1 range, so the optimization didn't trigger. The optimization can also fail to trigger in a few other cases, such as if the LHS has a cached hash, or if realloc
needs to move the string (in which case most of the optimization's code path executes, but it doesn't succeed in performing the operation in-place).
This optimization violates the normal rules for id
and +=
. Normally, +=
on immutable objects is supposed to create a new object before clearing the reference to the old object, so the new and old objects should have overlapping lifetimes, forbidding equal id
values. With the optimization in place, the string after the +=
has the same ID as the string before the +=
.
The language developers decided they cared more about people who would put string concatenation in a loop, see bad performance, and assume Python sucks, than they cared about this obscure technical point.