Is there some meaningful statistical data to justify keeping signed integer arithmetic overflow undefined?
I don't know about studies and statistics, but yes, there are definitely optimizations taking this into account that compilers actually do. And yes, they are very important (tldr loop vectorization for example).
Besides the compiler optimizations, there is another aspect to be taken into account. With UB you get C/C++ signed integers to behave arithmetically as you would expect mathematically. For instance x + 10 > x
holds true now (for valid code of course), but would not on a wrap-around behavior.
I've found an excellent article How undefined signed overflow enables optimizations in GCC from Krister Walfridsson’s blog listing some optimizations that take signed overflow UB into account. The following examples are from it. I am adding c++ and assembly examples to them.
If the optimizations look too simple, uninteresting or unimpactful, remember that these optimization are just steps in a much much larger chain of optimizations. And the butterfly effect does happen as a seemingly unimportant optimization at an earlier step can trigger a much more impactful optimization at a later step.
If the examples look nonsensical (who would write x * 10 > 0
) keep in mind that you can very easily get to this kind of examples in C and C++ with constants, macros, templates. Besides the compiler can get to this kind of examples when applying transformations and optimizations in its IR.
Signed integer expression simplification
Eliminate multiplication in comparison with 0
(x * c) cmp 0 -> x cmp 0
bool foo(int x) { return x * 10 > 0 }
foo(int): test edi, edi setg al ret
Eliminate division after multiplication
(x * c1) / c2 -> x * (c1 / c2) if c1 is divisible by c2
int foo(int x) { return (x * 20) / 10; }
foo(int): lea eax, [rdi+rdi] ret
Eliminate negation
(-x) / (-y) -> x / y
int foo(int x, int y) { return (-x) / (-y); }
foo(int, int): mov eax, edi cdq idiv esi ret
Simplify comparisons that are always true or false
x + c < x -> false x + c <= x -> false x + c > x -> true x + c >= x -> true
bool foo(int x) { return x + 10 >= x; }
foo(int): mov eax, 1 ret
Eliminate negation in comparisons
(-x) cmp (-y) -> y cmp x
bool foo(int x, int y) { return -x < -y; }
foo(int, int): cmp edi, esi setg al ret
Reduce magnitude of constants
x + c > y -> x + (c - 1) >= y x + c <= y -> x + (c - 1) < y
bool foo(int x, int y) { return x + 10 <= y; }
foo(int, int): add edi, 9 cmp edi, esi setl al ret
Eliminate constants in comparisons
(x + c1) cmp c2 -> x cmp (c2 - c1) (x + c1) cmp (y + c2) -> x cmp (y + (c2 - c1)) if c1 <= c2
The second transformation is only valid if c1 <= c2, as it would otherwise introduce an overflow when y has the value INT_MIN.
bool foo(int x) { return x + 42 <= 11; }
foo(int): cmp edi, -30 setl al ret
Pointer arithmetic and type promotion
If an operation does not overflow, then we will get the same result if we do the operation in a wider type. This is often useful when doing things like array indexing on 64-bit architectures — the index calculations are typically done using 32-bit int, but the pointers are 64-bit, and the compiler may generate more efficient code when signed overflow is undefined by promoting the 32-bit integers to 64-bit operations instead of generating type extensions.
One other aspect of this is that undefined overflow ensures that a[i] and a[i+1] are adjacent. This improves analysis of memory accesses for vectorization etc.
This is a very important optimization as loop vectorization one of the most efficient and effective optimization algorithms.
This is an example when changing an index from an unsigned index to a signed improves the generated assembly:
Unsigned version
#include <cstddef>
auto foo(int* v, std::size_t start)
{
int sum = 0;
for (std::size_t i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
With unsigned the case where start + 4
wraps around must be taken into account and a branch is generated to deal with this case (branches are bad for performance):
; gcc on x64 with -march=skylake
foo1(int*, unsigned long):
cmp rsi, -5
ja .L3
vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
.L3:
xor eax, eax
ret
; clang on x64 with -march=skylake
foo1(int*, unsigned long): # @foo1(int*, unsigned long)
xor eax, eax
cmp rsi, -4
jae .LBB0_2
vpbroadcastq xmm0, qword ptr [rdi + 4*rsi + 8]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
.LBB0_2:
ret
As a side note, using a narrower type would result in even worst assembly, inhibiting the use of SSE vectorized instructions:
#include <cstddef>
auto foo(int* v, unsigned start)
{
int sum = 0;
for (unsigned i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
; gcc on x64 with -march=skylake
foo(int*, unsigned int):
cmp esi, -5
ja .L3
mov eax, esi
mov eax, DWORD PTR [rdi+rax*4]
lea edx, [rsi+1]
add eax, DWORD PTR [rdi+rdx*4]
lea edx, [rsi+2]
add eax, DWORD PTR [rdi+rdx*4]
lea edx, [rsi+3]
add eax, DWORD PTR [rdi+rdx*4]
ret
.L3:
xor eax, eax
ret
; clang on x64 with -march=skylake
foo(int*, unsigned int): # @foo(int*, unsigned int)
xor eax, eax
cmp esi, -5
ja .LBB0_3
mov ecx, esi
add esi, 4
mov eax, dword ptr [rdi + 4*rcx]
lea rdx, [rcx + 1]
cmp rdx, rsi
jae .LBB0_3
add eax, dword ptr [rdi + 4*rcx + 4]
add eax, dword ptr [rdi + 4*rcx + 8]
add eax, dword ptr [rdi + 4*rcx + 12]
.LBB0_3:
ret
Signed version
Using a signed index however results in nice vectorized branchless code:
#include <cstddef>
auto foo(int* v, std::ptrdiff_t start)
{
int sum = 0;
for (std::ptrdiff_t i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
; gcc on x64 with -march=skylake
foo(int*, long):
vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
; clang on x64 with -march=skylake
foo(int*, long): # @foo(int*, long)
vpbroadcastq xmm0, qword ptr [rdi + 4*rsi + 8]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rsi]
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
Vectorized instruction are still used when using a narrower signed type:
#include <cstddef>
auto foo(int* v, int start)
{
int sum = 0;
for (int i = start; i < start + 4; ++i)
sum += v[i];
return sum;
}
; gcc on x64 with -march=skylake
foo(int*, int):
movsx rsi, esi
vmovdqu xmm0, XMMWORD PTR [rdi+rsi*4]
vpsrldq xmm1, xmm0, 8
vpaddd xmm0, xmm0, xmm1
vpsrldq xmm1, xmm0, 4
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
; clang on x64 with -march=skylake
foo(int*, int): # @foo(int*, int)
movsxd rax, esi
vpbroadcastq xmm0, qword ptr [rdi + 4*rax + 8]
vpaddd xmm0, xmm0, xmmword ptr [rdi + 4*rax]
vpshufd xmm1, xmm0, 85 # xmm1 = xmm0[1,1,1,1]
vpaddd xmm0, xmm0, xmm1
vmovd eax, xmm0
ret
Value range calculations
The compiler keeps track of the variables' range of possible values at each point in the program, i.e. for code such as
int x = foo(); if (x > 0) { int y = x + 5; int z = y / 4;
it determines that x has the range
[1, INT_MAX]
after the if-statement, and can thus determine that y has the range[6, INT_MAX]
as overflow is not allowed. And the next line can be optimized toint z = y >> 2;
as the compiler knows that y is non-negative.
auto foo(int x)
{
if (x <= 0)
__builtin_unreachable();
return (x + 5) / 4;
}
foo(int):
lea eax, [rdi+5]
sar eax, 2
ret
The undefined overflow helps optimizations that need to compare two values (as the wrapping case would give possible values of the form
[INT_MIN, (INT_MIN+4)]
or[6, INT_MAX]
that prevents all useful comparisons with<
or>
), such as
- Changing comparisons
x<y
to true or false if the ranges forx
andy
does not overlap- Changing
min(x,y)
ormax(x,y)
tox
ory
if the ranges do not overlap- Changing
abs(x)
tox
or-x
if the range does not cross0
- Changing
x/c
tox>>log2(c)
ifx>0
and the constantc
is a power of2
- Changing
x%c
tox&(c-1)
ifx>0
and the constantc
is a power of2
Loop analysis and optimization
The canonical example of why undefined signed overflow helps loop optimizations is that loops like
for (int i = 0; i <= m; i++)
are guaranteed to terminate for undefined overflow. This helps architectures that have specific loop instructions, as they do in general not handle infinite loops.
But undefined signed overflow helps many more loop optimizations. All analysis such as determining number of iteration, transforming induction variables, and keeping track of memory accesses are using everything in the previous sections in order to do its work. In particular, the set of loops that can be vectorized are severely reduced when signed overflow is allowed.
Not quite an example of optimization, but one useful consequence of undefined behaviour is -ftrapv
command line switch of GCC/clang. It inserts code which crashes your program on integer overflow.
It won't work on unsigned integers, in accordance with the idea that unsigned overflow is intentional.
The Standard's wording on signed integer overflow ensures that people won't write overflowing code on purpose, so ftrapv
is a useful tool to discover unintentional overflow.
Here's an actual little benchmark, bubble sort. I've compared timings without/with -fwrapv
(which means the overflow is UB/not UB). Here are the results (seconds):
-O3 -O3 -fwrapv -O1 -O1 -fwrapv
Machine1, clang 5.2 6.3 6.8 7.7
Machine2, clang-8 4.2 7.8 6.4 6.7
Machine2, gcc-8 6.6 7.4 6.5 6.5
As you can see, the not-UB (-fwrapv
) version is almost always slower, the largest difference is pretty big, 1.85x.
Here's the code. Note, that I intentionally chose an implementation, which should produce a larger difference for this test.
#include <stdio.h>
#include <stdlib.h>
void bubbleSort(int *a, long n) {
bool swapped;
for (int i = 0; i < n-1; i++) {
swapped = false;
for (int j = 0; j < n-i-1; j++) {
if (a[j] > a[j+1]) {
int t = a[j];
a[j] = a[j+1];
a[j+1] = t;
swapped = true;
}
}
if (!swapped) break;
}
}
int main() {
int a[8192];
for (int j=0; j<100; j++) {
for (int i=0; i<8192; i++) {
a[i] = rand();
}
bubbleSort(a, 8192);
}
}