x86-64 canonical address?
Section 3.3.7.1 of the Intel Manual covers this with 5 (difficult to digest) paragraphs, for me it's page 74 on the 4 volume set you can download from Intel site or go directly here: https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf
What these paragraphs say is canonical addresses are anything less than a full 64 bit address. There are different implementations of addressing such as 48 bit or 57 bit. (57-bit requires an extra level of page tables, increasing the cost of page-walks. See https://en.wikipedia.org/wiki/Intel_5-level_paging for more about this new CPU feature that can be left disabled).
A 48-bit implementation would have a high half canonical address starting at
0xFFFF800000000000
while the lower half would be
0x00007FFFFFFFFFFF
Bit 63 to whatever will signify it as a canonical address if you see all ones or all zeros. In a 57-bit implementation I'd immediately know I'm looking at a canonical address when I see 0xFF____ or 0x00____. (The low bit of the top byte is a significant address bit, and the other 7 are copies of it: i.e. correctly sign extended)
Maybe a helpful way to remember this is the word canonical itself means relating to a general rule, or way of doing something. In general, no one needs as many addresses as 64 bits can provide, so they are generally not used. Also if something is according to canon like in Star Trek or comic books, it's the way things were seen or done originally.
Now to answer WHY we have canonical addresses? No one will need to address up to 16 Exabytes (the theoretical limit of a 64 bit machine) so the second paragraph of that manual just says Intel architecture "defines" a 64 bit linear address but it looks like no one will use it. Now just in case, the third paragraph says the implementation will still check those first few bits and if NOT in canonical form, to generate a "general-protection" exception.
The main reason for checking for canonical addresses instead of silently ignoring the upper bits is to make sure software is forward compatible with future hardware that supports more virtual address bits.
I suggest that you download the full software developer's manual. The documentation is available in separate volumes, but that link gives you all seven volumes in a single massive PDF, which makes it easier to search for things.
The answer is in section 3.3.7.1. The first line of that section states
In 64-bit mode, an address is considered to be in canonical form if address bits 63 through to the most-significant implemented bit by the microarchitecture are set to either all ones or all zeros.
It goes on from there...
You can use cpuid
to query the supported virtual address width on that CPU. (i.e. "implemented by the microarchitecture".) Or you can normally just assume 48-bit.
I.e. a canonical virtual address is 48 bits correctly sign-extended to 64. If the high bits don't match, it's non-canonical and will fault if you attempt to dereference it.
(Or with Intel's upcoming 5-level page table extension, 57 bits sign-extended to 64).
This answer less detailed then previous ones but IMHO easier to understand:
While 64-bit processors have 64-bit wide registers, systems generally do not implement all 64-bits for addressing (16 exabytes of theoretical physical memory).
Thus most architectures define an unimplemented region of the address space which the processor will consider invalid for use. x86-64 (...) define the most-significant valid bit of an address, which must then be sign-extended (...) to create a valid address. The result of this is that the total address space is effectively divided into two parts, an upper and a lower portion, with the addresses in-between considered invalid. (...) Valid addresses are termed canonical addresses (invalid addresses being non-canonical).
From https://www.bottomupcs.com/virtual_memory_is.xhtml
Sign-extended
is same bit most significant bit copied to the upper bits address. Upper is 11111...
lower 00000...
.