Performance of unaligned SIMD load/store on aarch64

If a load/store has to be split or crosses a cache line, at least one extra cycle is required.

There are exhaustive tables that specify the number of cycles required for various alignments and numbers of registers for the Cortex-A8 (in-order) and Cortex-A9 (partially OoO). For example, vld1 with one reg has a 1-cycle penalty for unaligned access vs. 64 bit-aligned access.

The Cortex-A55 (in-order) does up to 64-bit loads and 128-bit stores, and accordingly, section 3.3 of its optimization manual states a 1-cycle penalty is incurred for:

• Load operations that cross a 64-bit boundary
• 128-bit store operations that cross a 128-bit boundary

The Cortex-A75 (OoO) has penalties per section 5.4 of its optimization guide for:

• Load operations that cross a 64-bit boundary.
• In AArch64, all stores that cross a 128-bit boundary.
• In AArch32, all stores that cross a 64-bit boundary.

And as in Guillermo's answer, the A57 (OoO) has penalties for:

• Load operations that cross a cache-line (64-byte) boundary
• Store operations that cross a [128-bit] boundary

I'm somewhat skeptical that the A57 does not have a penalty for crossing 64-bit boundaries given that the A55 and A75 do. All of these have 64-byte cache lines; they should all have penalties for crossing cache lines too. Finally, note that there's unpredictable behavior for split access crossing pages.

From some crude testing (without perf counters) with a Cavium ThunderX, there seems to be closer to a 2-cycle penalty, but that might be an additive effect of having back-to-back unaligned loads and stores in a loop.

AArch64 NEON instructions don't distinguish between aligned and un-aligned (see LD1 for example). For AArch32 NEON, alignment is specified statically in the addressing (VLDn):

vld1.32 {d16-d17}, [r0]    ; no alignment
vld1.32 {d16-d17}, [r0@64] ; 64-bit aligned
vld1.32 {d16-d17}, [r0:64] ; 64 bit-aligned, used by GAS to avoid comment ambiguity

I don't know if aligned access without alignment qualifiers performs any slower than access with alignment qualifiers on recent chips running in AArch32 mode. Somewhat old documentation from ARM encourages using qualifiers whenever possible. (Intel refined their chips such that unaligned and aligned moves perform the same when the address is aligned, by comparison.)

If you're using intrinsics, MSVC has _ex-suffixed variants that accept the alignment. A reliable way to get GCC to emit an alignment qualifier is with __builtin_assume_aligned.

vld1q_u16_ex(addr, 64);
// GCC:
addr = (uint16_t*)__builtin_assume_aligned(addr, 8);

According to the Cortex-A57 Software Optimization Guide in section 4.6 Load/Store Alignment it says:

The ARMv8-A architecture allows many types of load and store accesses to be arbitrarily aligned. The Cortex-A57 processor handles most unaligned accesses without performance penalties. However, there are cases which reduce bandwidth or incur additional latency, as described below:

  • Load operations that cross a cache-line (64-byte) boundary
  • Store operations that cross a 16-byte boundary

So it may depend on the processor that you are using, out of order (A57, A72, A-72, A-75) or in-order (A-35, A-53, A-55). I didn't find any optimization guide for the in-order processors, however they do have a Hardware Performance Counter that you could use to check if the number of unaligned instructions do affect performance:

    0xOF_UNALIGNED_LDST_RETIRED Unaligned load-store

This can be used with the perf tool.

There are no special instructions for unaligned accesses in AArch64.