Is it safe to read past the end of a buffer within the same page on x86 and x64?
Yes, it's safe in x86 asm, and existing libc strlen(3)
implementations take advantage of this in hand-written asm. And even glibc's fallback C, but it compiles without LTO so it it can never inline. It's basically using C as a portable assembler to create machine code for one function, not as part of a larger C program with inlining. But that's mostly because it also has potential strict-aliasing UB, see my answer on the linked Q&A. You probably also want a GNU C __attribute__((may_alias))
typedef instead of plain unsigned long
as your wider type, like __m128i
etc. already use.
It's safe because an aligned load will never cross a higher alignment boundary, and memory protection happens with aligned pages, so at least 4k boundaries1Any naturally-aligned load that touches at least 1 valid byte can't fault. It's also safe to just check if you're far enough from the next page boundary to do a 16-byte load, like if (p & 4095 > (4096 - 16)) do_special_case_fallback
. See the section below about that for more detail.
It's also generally safe in C compiled for x86, as far as I know. Reading outside an object is of course Undefined Behaviour in C, but works in C-targeting-x86. I don't think compilers explicitly / on purpose define the behaviour, but in practice it works that way.
I think it's not the kind of UB that aggressive compilers will assume can't happen while optimizing, but confirmation from a compiler-writer on this point would be good, especially for cases where it's easily provable at compile-time that an access goes out of past the end of an object. (See discussion in comments with @RossRidge: a previous version of this answer asserted that it was absolutely safe, but that LLVM blog post doesn't really read that way).
This is required in asm to go faster than 1 byte at a time processing an implicit-length string. In C in theory a compiler could know how to optimize such a loop, but in practice they don't so you have to do hacks like this. Until that changes, I suspect that the compilers people care about will generally avoid breaking code that contains this potential UB.
There's no danger when the overread isn't visible to code that knows how long an object is. A compiler has to make asm that works for the case where there are array elements as far as we actually read. The plausible danger I can see with possible future compilers is: after inlining, a compiler might see the UB and decide that this path of execution must never be taken. Or that the terminating condition must be found before the final not-full-vector and leave that out when fully unrolling.
The data you get is unpredictable garbage, but there won't be any other potential side-effects. As long as the your program isn't affected by the garbage bytes, it's fine. (e.g. use bithacks to find if one of the bytes of a uint64_t
are zero, then a byte loop to find the first zero byte, regardless of what garbage is beyond it.)
Unusual situations where this wouldn't be safe in x86 asm
Hardware data breakpoints (watchpoints) that trigger on a load from a given address. If there's a variable you're monitoring right after an array, you could get a spurious hit. This might be a minor annoyance to someone debugging a normal program. If your function will be part of a program that uses x86 debug registers D0-D3 and the resulting exceptions for something that could affect correctness, then be careful with this.
Or similarly a code checker like valgrind could complain about reading outside an object.
Under a hypothetical 16 or 32-bit OS could that uses segmentation: A segment limit can use 4k or 1-byte granularity so it's possible to create a segment where the first faulting offset is odd. (Having the base of the segment aligned to a cache line or page is irrelevant except for performance). All mainstream x86 OSes use flat memory models, and x86-64 removes support for segment limits for 64-bit mode.
Memory-mapped I/O registers right after the buffer you wanted to loop over with wide loads, especially the same 64B cache-line. This is extremely unlikely even if you're calling functions like this from a device driver (or a user-space program like an X server that has mapped some MMIO space).
If you're processing a 60-byte buffer and need to avoid reading from a 4-byte MMIO register, you'll know about it and will be using a volatile T*
. This sort of situation doesn't happen for normal code.
strlen
is the canonical example of a loop that processes an implicit-length buffer and thus can't vectorize without reading past the end of a buffer. If you need to avoid reading past the terminating 0
byte, you can only read one byte at a time.
For example, glibc's implementation uses a prologue to handle data up to the first 64B alignment boundary. Then in the main loop (gitweb link to the asm source), it loads a whole 64B cache line using four SSE2 aligned loads. It merges them down to one vector with pminub
(min of unsigned bytes), so the final vector will have a zero element only if any of the four vectors had a zero. After finding that the end of the string was somewhere in that cache line, it re-checks each of the four vectors separately to see where. (Using the typical pcmpeqb
against a vector of all-zero, and pmovmskb
/ bsf
to find the position within the vector.) glibc used to have a couple different strlen strategies to choose from, but the current one is good on all x86-64 CPUs.
Usually loops like this avoid touching any extra cache-lines they don't need to touch, not just pages, for performance reasons, like glibc's strlen.
Loading 64B at a time is of course only safe from a 64B-aligned pointer, since naturally-aligned accesses can't cross cache-line or page-line boundaries.
If you do know the length of a buffer ahead of time, you can avoid reading past the end by handling the bytes beyond the last full aligned vector using an unaligned load that ends at the last byte of the buffer.
(Again, this only works with idempotent algorithms, like memcpy, which don't care if they do overlapping stores into the destination. Modify-in-place algorithms often can't do this, except with something like converting a string to upper-case with SSE2, where it's ok to reprocess data that's already been upcased. Other than the store-forwarding stall if you do an unaligned load that overlaps with your last aligned store.)
So if you are vectorizing over a buffer of known length, it's often best to avoid overread anyway.
Non-faulting overread of an object is the kind of UB that definitely can't hurt if the compiler can't see it at compile time. The resulting asm will work as if the extra bytes were part of some object.
But even if it is visible at compile-time, it generally doesn't hurt with current compilers.
PS: a previous version of this answer claimed that unaligned deref of int *
was also safe in C compiled for x86. That is not true. I was a bit too cavalier 3 years ago when writing that part. You need a typedef with GNU C __attribute__((aligned(1),may_alias))
, or memcpy
, to make that safe. The may_alias
part isn't needed if you only access it via signed/unsigned int*
and/or `char*, i.e. in ways that wouldn't violate the normal C strict-aliasing rules.
The set of things ISO C leaves undefined but that Intel intrinsics requires compilers to define does include creating unaligned pointers (at least with types like __m128i*
), but not dereferencing them directly. Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?
Checking if a pointer is far enough from the end of a 4k page
This is useful for the first vector of strlen; after this you can p = (p+16) & -16
to go to the next aligned vector. This will partially overlap if p
was not 16-byte aligned, but doing redundant work is sometimes the most compact way to set up for an efficient loop. Avoiding it might mean looping 1 byte at a time until an alignment boundary, and that's certainly worse.
e.g. check ((p + 15) ^ p) & 0xFFF...F000 == 0
(LEA / XOR / TEST) which tells you that the last byte of a 16-byte load has the same page-address bits as the first byte. Or p+15 <= p|0xFFF
(LEA / OR / CMP with better ILP) checks that the last byte-address of the load is <= the last byte of the page containing the first byte.
Or more simply, p & 4095 > (4096 - 16)
(MOV / AND / CMP), i.e. p & (pgsize-1) < (pgsize - vecwidth)
checks that the offset-within-page is far enough from the end of a page.
You can use 32-bit operand-size to save code size (REX prefixes) for this or any of the other checks because the high bits don't matter. Some compilers don't notice this optimization, so you can cast to unsigned int
instead of uintptr_t
, although to silence warnings about code that isn't 64-bit clean you might need to cast (unsigned)(uintptr_t)p
. Further code-size saving can be had with ((unsigned int)p << 20) > ((4096 - vectorlen) << 20)
(MOV / SHL / CMP), because shl reg, 20
is 3 bytes, vs. and eax, imm32
being 5, or 6 for any other register. (Using EAX will also allow the no-modrm short form for cmp eax, 0xfff
.)
If doing this in GNU C, you probably want typedef unsigned long aliasing_unaligned_ulong __attribute__((aligned(1),may_alias));
to make it safe to do unaligned accesses.
If you permit consideration of non-CPU devices, then one example of a potentially unsafe operation is accessing out-of-bounds regions of PCI-mapped memory pages. There's no guarantee that the target device is using the same page size or alignment as the main memory subsystem. Attempting to access, for example, address [cpu page base]+0x800
may trigger a device page fault if the device is in a 2KiB page mode. This will usually cause a system bugcheck.