Which cache-coherence-protocol does Intel and AMD use?
Intel uses MESIF protocol (http://www.realworldtech.com/common-system-interface/5/, https://en.wikipedia.org/wiki/MESIF_protocol) in QuickPath and AMD uses MOESI protocol (https://en.wikipedia.org/wiki/MOESI_protocol, http://www.m5sim.org/MOESI_hammer) with or without Probe Filter in HyperTransport. But these protocols are for inter-chip communication (a AMD bulldozer socket has 2 chips in MCM). As far as I know, in both processors intra-chip coherence is made at L3 cache.
A tool you could use to check for NUMA performance issues is numagrind: http://dx.doi.org/10.1109/IPDPS.2011.100
This answer applies to the Intel CPUs that have an inclusive L3 cache and Sandy Bridge style ring bus (i.e. not the Nehalem/Westmere EX one), which is all server CPUs after Sandy Bridge up until Skylake server.
It is widely said that Intel uses MESIF, but AFAICT, the F state doesn't exist in the core. The core (*) lines will be in MESI states because with an inclusive L3 cache, the data is read directly out of L3 if it is present in more than 1 core. A dedicated F state is not required. It does however exist in the cores on skylake server which has an non-inclusive L3.
The cores send IDI packets to the L3 cache slice Cbo (controller) that handles that address range (it is interleaved based on a hash function of the upper part of the cache set selector bits of the address modulo the number of CBos). A DRd packet is sent by the core when it requests a line that isn't owned by the core and it either receives it in S state if it is present in other cores or E if it isn't present in other cores. The L3 cache slice Cbo uses the snoop filter for the line to decide whether to return it in E state (in no other core but in L3/ not in L3), or S state (in L3 and present in another core; send a downgrade to that core E->S). By making the first request of a line default to E state if it isn't owned by other cores rather than S state is an optimisation because the core doesn't have to perform an RFO, at the slight defect of the L3 cache slice having to send downgrades to cores (which is just extra traffic in the background compared to the actual delay performing an RFO would cause).
An RFO packet is sent to the LLC slice Cbo when the line isn't owned at all because a write is about to be performed to it by the core, in this case, the CBo needs to send invalidates if it is owned in more than one core, or a snoop invalidate if it is only owned in one core, because the CBo does not know whether this is modified or not, as well as snooping the home agent that owns the address cross-socket, and returns the line to the core as well as upgrading it. When the line is owned in an S state, it sends a write invalidate WiL
to the L3 slice CBo, which will then invalidate other cores and upgrade the requester to the E state. It results in the changing of an S state to E and invalidating other cores. Presumably there is a flag to indicate it is in S state in the packet to eliminate the unnecessary load.
The F state is only for the L3 cache (caching agent) in the context of multi socket snooping between other caching agents and the home agent in the home node, as the home agent HitME cache is non-inclusive of any socket's L3. In source snoop mode without a directory, only one caching agent (collective set of CBos in a NUMA node) will respond to a broadcast snoop if it has the F state rather than resulting in multiple responses. In home snoop mode with a directory cache + directory, the directory cache + directory bits mean that where possible, only one request is going to be sent anyway, but when it's not cached and a broadcast is sent, the F state helps, as there are not multiple responses. Because a cache may unilaterally discard (invalidate) a line in the S or F states, it is possible that no cache has a copy in the F state, even though copies in the S state exist. In this case, a request for the line is satisfied (less efficiently, but still correctly) from main memory (because no caching agent will respond when they're in the S state).
The 'home node home agent' is the home agent that the SAD decoded coherent DRAM address interleaves to (i.e. the home agent that owns that address).
IDI opcodes (which are used for core<->uncore communication) in a 2014 performance monitoring manual for Xeon E5 v2s do not show any F states (only QPI opcodes do, which talk about caching agents and home agents, and are for uncore<->uncore communication), but a 2017 performance monitoring manual shows IDI opcodes dealing with F states as well i.e. WbEFtoE and WbEFtoI and talking about 'cores'. Searching the document for skylake shows a result for skylake server, which contains a non inclusive L3, which says it all.
Because L2 is non-inclusive on recent Intel desktop CPUs, it could mean that L1i and L1d implement their own F states, which could be internally used by the L2 between the 2 caches it supports (L1i, L1d which are shared by both hyperthreads in the core) for cache misses, although this is not necessary if the L1d and L1i caches are able to query/invalidate each other internally, which seems faster than going to L2 and then L2 having to query the cache the request didn't originate from, and there is only one other cache to query, although I actually do not think L1i and L1d are coherent, except for whatever SMC implementation exists (self modifying code), which I don't know the details of. L2 cache certainly doesn't need F states though.