Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?
Single-threaded memory bandwidth on modern CPUs is limited by max_concurrency / latency
of the transfers from L1D to the rest of the system, not by DRAM-controller bottlenecks. Each core has 10 Line-Fill Buffers (LFBs) which track outstanding requests to/from L1D. (And 16 "superqueue" entries which track lines to/from L2).
(Update: experiments show that Skylake probably has 12 LFBs, up from 10 in Broadwell. e.g. Fig7 in the ZombieLoad paper, and other performance experiments including @BeeOnRope's testing of multiple store streams)
Intel's many-core chips have higher latency to L3 / memory than quad-core or dual-core desktop / laptop chips, so single-threaded memory bandwidth is actually much worse on a big Xeon, even though the max aggregate bandwidth with many threads is much better. They have many more hops on the ring bus that connects cores, memory controllers, and the System Agent (PCIe and so on).
SKX (Skylake-server / AVX512, including the i9 "high-end desktop" chips) is really bad for this: L3 / memory latency is significantly higher than for Broadwell-E / Broadwell-EP, so single-threaded bandwidth is even worse than on a Broadwell with a similar core count. (SKX uses a mesh instead of a ring bus because that scales better, see this for details on both. But apparently the constant factors are bad in the new design; maybe future generations will have better L3 bandwidth/latency for small / medium core counts. The private per-core L2 is bumped up to 1MiB though, so maybe L3 is intentionally slow to save power.)
(Skylake-client (SKL) like in the question, and later quad/hex-core desktop/laptop chips like Kaby Lake and Coffee Lake, still use the simpler ring-bus layout. Only the server chips changed. We don't yet know for sure what Ice Lake client will do.)
A quad or dual core chip only needs a couple threads (especially if the cores + uncore (L3) are clocked high) to saturate its memory bandwidth, and a Skylake with fast DDR4 dual channel has quite a lot of bandwidth.
For more about this, see the Latency-bound Platforms section of this answer about x86 memory bandwidth. (And read the other parts for memcpy/memset with SIMD loops vs. rep movs/rep stos
, and NT stores vs. regular RFO stores, and more.)
Also related: What Every Programmer Should Know About Memory? (2017 update on what's still true and what's changed in that excellent article from 2007).
I finally got VTune (evalutation) up and running. It gives a DRAM bound score of .602 (between 0 and 1) on Broadwell-E and .324 on Skylake, with a huge part of the Broadwell-E delay coming from Memory Latency. Given that the memory sticks are the same speed (except dual-channel configured in Skylake and quad-channel in Broadwell-E), my best guess is that something about the memory controller in Skylake is just tremendously better.
It makes buying into the Broadwell-E architecture a much tougher call, and requires that you really need the extra cores to even consider it.
I also got L3/TLB miss counts. On Broadwell-E, TLB miss count was about 20% higher, and L3 miss count about 36% higher.
I don't think this is really an answer for "why" so I won't mark it as such, but is as close as I think I'll get to one for the time being. Thanks for all the helpful comments along the way.