Performance 32 bit vs. 64 bit arithmetic
It depends on the exact CPU and operation. On 64-bit Pentium IVs, for example, multiplication of 64-bit registers was quite a bit slower. Core 2 and later CPUs have been designed for 64-bit operation from the ground up.
Generally, even code written for a 64-bit platform uses 32-bit variables where values will fit in them. This isn't primarily because arithmetic is faster (on modern CPUs, it generally isn't) but because it uses less memory and memory bandwidth.
A structure containing a dozen integers will be half the size if those integers are 32-bit than if they are 64-bit. This means it will take half as many bytes to store, half as much space in the cache, and so on.
64-bit native registers and arithmetic are used where values may not fit into 32-bits. But the main performance benefits come from the extra general purpose registers available in the x86_64 instruction set. And of course, there are all the benefits that come from 64-bit pointers.
So the real answer is that it doesn't matter. Even if you use x86_64 mode, you can (and generally do) still use 32-bit arithmetic where it will do, and you get the benefits of larger pointers and more general purpose registers. When you use 64-bit native operations, it's because you need 64-bit operations, and you know they'll be faster than faking it with multiple 32-bit operations -- your only other choice. So the relative performance of 32-bit versus 64-bit registers should never be a deciding factor in any implementation decision.
I just stumbled upon this question, but I think one very important aspect is missing here: if you really look down into assembly code using the type 'int' for indices will likely slow down the code your compiler generates. This is because 'int' defaults to a 32bit type on many 64bit compilers and platforms (Visual Studio, GCC) and doing address calculations with pointers (which are necessarily 64bit on a 64bit OS) and 'int' will cause the compiler to emit unnecessary conversions between 32 and 64bit registers. I've just experienced this in a very performance critical inner loop of my code. Switching from 'int' to 'long long' as loop index improved my algorithm run time by about 10%, which was quite a huge gain considering the extensive SSE/AVX2 vectorization I was already using at that point.