Why is tailcall optimization not performed for types of class MEMORY?

Looks like a missed optimization bug that you should report, if there isn't already a duplicate open for gcc and clang.

(It's not rare for both gcc and clang to have the same missed optimization in cases like this; do not assume that something is illegal just because compilers don't do it. The only useful data is when compilers do perform an optimization: it's either a compiler bug or at least some compiler devs decided it was safe according to their interpretation of whatever standards.)

We can see GCC is returning its own incoming arg instead of returning the copy of it that create() will return in RAX. This is the missed optimization that's blocking tailcall optimization.

The ABI requires a function with a MEMORY-type return value to return the "hidden" pointer in RAX1.

GCC/clang do already realize they can elide actual copying by passing along their own return-value space, instead of allocating fresh space. But to do tailcall optimization, they'd have to realize that they can leave their callee's RAX value in RAX, instead of saving their incoming RDI in a call-preserved register.

If the ABI didn't require returning the hidden pointer in RAX, I expect gcc/clang would have had no problem with passing along the incoming RDI as part of an optimized tailcall.

Generally compilers like to shorten dependency chains; that's probably what's going on here. The compiler doesn't know that the latency from rdi arg to rax result of create() is probably just one mov instruction. Ironically, this could be a pessimization if the callee saves/restores some call-preserved registers (like r12), introducing a store/reload of the return-address pointer. (But that mostly only matters if anything even uses it. I did get some clang code to do so, see below.)

Footnote 1: Returning the pointer sounds like a good idea, but almost invariably the caller already knows where it put the arg in its own stack frame and will just use an addressing mode like 8(%rsp) instead of actually using RAX. At least in compiler-generated code, the RAX return value will typically go unused. (And if necessary, the caller can always save it somewhere themselves.)

As discussed in What prevents the usage of a function argument as hidden pointer? there are serious obstacles to using anything other than space in the caller's stack frame to receive a retval.

Having the pointer in a register just saves an LEA in the caller if the caller wants to store the address somewhere, if it is a static or stack address.

However, this case is close to one where it would be useful. If we're passing along our own retval space to a child function, we might want to modify that space after the call. Then it is useful for easy access to that space, e.g. to modify a return value before we return.

#define T struct Vec3

T use2(){
    T tmp = create();
    tmp.y = 0.0;
    return tmp;

Efficient handwritten asm:

        callq   create
        movq    $0, 8(%rax)

Actual clang asm at least still uses return-value optimization, vs. GCC9.1 copying. (Godbolt)

# clang -O3
use2:                                   # @use2
        pushq   %rbx
        movq    %rdi, %rbx
        callq   create
        movq    $0, 8(%rbx)
        movq    %rbx, %rax
        popq    %rbx

This ABI rule perhaps exists specifically for this case, or maybe the ABI designers were picturing that the retval space might be newly-allocated dynamic storage (which the caller would have to save a pointer to if the ABI didn't provide it in RAX). I didn't try that case.