How are beats formed when frequencies combine?
Beats can be thought of as the next level of complication from constructive destructive interference. To demonstrate this best, we should visualize what actually happens when we sum two sine waves of different frequencies:
There's no magic going on here, this is just straight up addition.
What is happening is that sometimes the two signals are constructively interfering, and sometimes they are destructively interfering. The rate at which they go back and forth between constructive and destructive is defined by the difference in frequencies, and is called the "beat frequency." You can see that there is still a high frequency sine wave there... you still hear the "correct" note (I believe it's the average of the two frequencies), but you also hear what we call an "envelope," making that high frequency go louder and softer. Those are the beats.
As an alternative explanation, this beat interference is similar to the Moiré pattern which you get for instance if you overlay two sets of line gratings with different spacing.
https://upload.wikimedia.org/wikipedia/commons/2/26/Moire1_95.png
In this picture the two line gratings would correspond to the two frequencies (512 Hz, 516 Hz) from your speakers and the dark Moiré pattern which has bigger spacing (=lower frequency) would correspond to your beat frequency (4 Hz).
Well, imagine that you superpose two signals (i.e. using two speakers, one emitting a signal at $f_1$ and another at $f_2$). Imagine that these signals are in phase at $t = 0$. Since they have very different frequencies, they will oscillate at very different speeds, and if you add their waveforms just as in the picture, the sum will appear to be random.
If, however, the frequencies are quite similar ($|f_1-f_2|$ is small), then at the beginning the signals will remain approximately in phase for some time, and add constructively. One of the signals will slowly drift though, and at some point they will reach anti-coincidence, and add destructively. It is very easy to understand mathematically, Using $\cos(x) + \cos(y) = 2\cos(\frac{x-y}{2})\cos(\frac{x+y}{2})$. Using these formulas we can find the output amplitude of the two-speaker device : \begin{equation} S(t) = 2\cos\left(2\pi \frac{f_1-f_2}{2} t\right)\cos\left(2\pi\frac{f_1+f_2}{2} t\right) \end{equation}
The shape of this signal is the following (blue) :
The envelope is due to the beats modulation frequency, which equates $|f_1-f_2|$. In your case, it has to be 4 Hz, so : \begin{equation} f_2 = f_1 \pm \mbox{ 4 Hz} \end{equation}
So, if $f_1$ is 512 Hz, then $f_2$ is either 508 Hz or 516 Hz. So, the spectrum is still just two peaks at $f_1$ and $f_2$, and you are right to say that they are the frequencies heard, but the envelope of the signals is periodic with a frequency 4 Hz.