Why is fastcall slower than stdcall?
Several reasons
- At least in most decent x86 implementations, register renaming is in effect -- the effort that looks like's being saved by using a register instead of memory might not be doing anything on the hardware level.
- Sure, you save some stack movement effort with
__fastcall
, but you reduce the number of registers available for use in the function without modifying the stack.
Most of the time where __fastcall
would be faster the function is simple enough to be inlined in any case, which means that it really doesn't matter in real software. (Which is one of the main reasons why __fastcall
is not often used)
Side note: What was wrong with Anon's answer?
__fastcall
was introduced a long time ago. At the time, Watcom C++ was beating Microsoft for optimization, and a number of reviewers picked out its register-based calling convention as one (possible) reason why.
Microsoft responded by adding __fastcall
, and they've retained it ever since -- but I don't think they ever did much more than enough to be able to say "we have a register-based calling convention too..." Their preference (especially since the 32-bit migration) seems to be for __stdcall
. They've put quite a bit of work into improving their code generation with it, but (apparently) not nearly so much with __fastcall
. With on-chip caching, the gain from passing things in registers isn't nearly as great as it was then anyway.
Fastcall is really only meaningful if you use full optimization (otherwise its effects will be buried by other artifacts), but as you note, with full optimization, the functions will be inlined and you won't see the effect of calling conventions at all.
So to actually test this, you need to make the functions extern
declarations with the actual definitions in a separate source file that you compile separately and link with your main routine. When you do that, you'll see that __fastcall is consistently ~25% faster with small functions like this.
The upshot is that __fastcall is really only useful if you have a lot of calls to tiny functions that can't be inlined because they need to be separately compiled.
Edit
So with separate compilation and gcc -O3 -fomit-frame-pointer -m32
I see quite different code for the two functions:
func:
leal 5(%ecx), %eax
ret
func2:
movl 4(%esp), %eax
addl $5, %eax
ret
Running that with iter=5000 consistently gives me results close to
9990000
14160000
indicating that the fastcall version is a shade over 40% faster.
Your micro-benchmark produces irrelevant results. __fastcall
has specific uses with SSE instructions (see XNAMath) , clock()
is not even remotely a suitable timer for benchmarking, and __fastcall
exists for multiple platforms like Itanium and some others too, not just for x86, and in addition, your whole program can be effectively optimized to nothing except the printf
statements, making the relative performance of __fastcall
or __stdcall
very, very irrelevant.
Finally, you've forgotten to realize the main reason that a lot of things are done the way they are- legacy. __fastcall
may well have been significant before compiler inlining became as aggressive and effective as it is today, and no compiler will remove __fastcall
as there will be programs that depend on it. That makes __fastcall
a fact of life.