How to communicate faster than the system clock
I guess the confusion is you assume you can only send one bit per clock cycle. There are lots of ways a communication scheme can essentially encode more than one bit per symbol. A symbol is abstract idea as the atom of transfer in a communication system.
It's really too big of a topic to cover in an answer to this question in any depth, but imagine you weren't constrained to binary values and instead could send one of 1024 voltages as a symbol. In effect that would be 10 bits of information per symbol, and you would get 10x the "clock speed" in bandwidth. That's how the old NTSC video encodes data for example.
Another way of getting that kind of bandwidth is using buffering and then delegating the transport to specialized transmitters and receivers, sometimes called SERDES (for serializer / deserializer) blocks. You can't sustain the throughput that these are capable of without a source or sink capable of keeping up, but you can reduce the latency of the transfer of blocks of information between computing nodes using something like that. Look up Phase-Locked Loops (aka PLLs), as FPGAs and ASICs can use these to derive faster clocks from a basis clock to do this sort of thing.
Another way is simply having lots of parallel channels to transmit the data on. Think of the old parallel ports on PCs, in a single clock you transfer a whole bunch of bits, each bit on its own dedicated wire. USB-C and its kin have a lot more data pins than one RX and one TX for example.
Bandwidth is an aggregate property involving the net effect of all kinds of techniques like these, because many can be used together even.
There is no direct correlation between processor speed and peripheral speed.
It is not 10 ×. Thunderbolt 3 - \$\frac {40 Gbps} {8 bits/byte} = 5 Gbyte/s\$. This rate does not even seem unrealistic for a 4.2GHz 64-bit processor.
But that is not what we are dealing with here. We have a peripheral with serial communication + graphic card. Four times data + twice video bandwidth of existing capabilities. As the link says, desktop performance from a laptop. One port to link them all and in the darkness, bind them.
From Thunderbolt 3 – The USB-C That Does It All
Users have long wanted desktop-level performance from a mobile computer. Thunderbolt was developed to simultaneously support the fastest data and most video bandwidth available on a single cable, while also supplying power. Then recently the USB group introduced the USB-C connector, which is small, reversible, fast, supplies power, and allows other I/O in addition to USB to run on it, maximizing its potential. So in the biggest advancement since its inception, Thunderbolt 3 brings Thunderbolt to USB-C at 40Gbps, fulfilling its promise, creating one compact port that does it all.
There are multiple clocks within a CPU; peripherals can run much faster than the listed CPU speed, either by running off a faster clock or by implementing parallel communication.
No and no and no.
There are multiple clocks in a computer. Inside a CPU there is one clock. Peripherals can either derive their clock rate from the system clock (slower) or use a crystal to make their own clock.
Parallel communications have gone the way of the dodo, obsolete. Parallel communications were limited to short distances. USB, I2C, I2S, CAN, etc. are all serial protocols.
Your 4.2GHz processor does not communicate at 4.2GHz. That's the clock rate, a better indication is MIPs. And that is program instructions, not communicating externally.
You cannot equate a 64-bit processor running at 4.2GHz with a peripheral running serially at 20GHz. The 20GHz clock is not derived from the 4.2GHz. At 20GHz, the frequency is more analog than digital.
Now bit-banging, properly designed, the 4.2GHz 64-bit processor could probably do 20GHz serially (5 Gbyte/s), but that's not it's purpose.
The other answers have focused more on the Thunderbolt side of things, but let’s look back at the statement
Intel's latest i9 processor speed... about 4.2GHz max
4.2 GHz is the system clock, which (in a very, very simplistic way) is comparable to the number of instructions per second, per core (it’s really a lot more complex than that, as not all instructions take the same time to execute, there are wait times, etc.).
But on each cycle, the CPU will process data (to/from registers, caches, RAM and possibly I/O, from fastest to slowest). In the meantime, other peripherals can also read from/write to RAM without the CPU being involved (that’s called DMA).
The main bottleneck is then often RAM. It needs to be fast enough to feed the CPU as needed (for instructions to run and data to process), do DMA, and is in some cases shared with a GPU or acting as a frame buffer for video (there is then some component that is reading the RAM that acts as a frame buffer to send it to a video output, on each frame — for a Full HD 1920 x 1080 resolution at 60 Hz with 24-bit colour, that's nearly 3 Gbits/s, for 4K@60 fps, 4 times that).
RAM uses wide buses, usually 32 or 64-bits wide. There may be several separate channels. The fastest RAM currently seems to be DDR4-3200, which allows 3200 million 64-bit transfers per second. That's 25600 Mbytes/s or 204800 Mbits/s (over 200 Gbits/s), per channel.
An i9-9980XE CPU can have 4 memory channels. That means RAM could support over 800 Gbits/s, so 40 Gbps is just peanuts compared to that.
The impressive part about Thunderbolt is getting that speed over longer distances (not a few cm on a motherboard), over a relatively simple cable (not hundreds of connectors, as required to support multiple 64-bit RAM buses).