SPI Bus Termination Considerations
Talking about signal termination is like opening a can of worms. This is a HUGE subject that is difficult to summarize in just a couple hundred words. Therefore, I won't. I am going to leave a huge amount of stuff out of this answer. But I will also give you a big warning: There is much misinformation about terminating resistors on the net. In fact, I would say that most of what's found on the net is wrong or misleading. Some day I'll write up something big and post it to my blog, but not today.
The first thing to note is that the resistor value to use for your termination must be related to your trace impedance. Most of the time the resistor value is the same as your trace impedance. If you don't know what the trace impedance is then you should figure it out. There are many online impedance calculators available. A Google search will bring up dozens more.
Most PCB traces have an impedance from 40 to 120 ohms, which is why you found that a 1k termination resistor did almost nothing and a 100-ish ohm resistor was much better.
There are many types of termination, but we can roughly put them into two categories: Source and End termination. Source termination is at the driver, end termination is at the far end. Within each category, there are many types of termination. Each type is best for different uses, with no one type good for everything.
Your termination, a single resistor to ground at the far end, is actually not a very good. In fact, it's wrong. People do it, but it isn't ideal. Ideally that resistor would go to a different power rail at half of your power rail. So if the I/O voltage is 3.3v then that resistor will not go to GND, but another power rail at half of 3.3v (a.k.a. 1.65v). The voltage regulator for this rail has to be special because it needs to source AND sink current, where most regulators only source current. Regulators that work for this use will mention something about termination in the first page of the datasheet.
The big problem with most end-termination is that they consume lots of current. There is a reason for this, but I won't go into it. For low-current use we must look at source termination. The easiest and most common form of source termination is a simple series resistor at the output of the driver. The value of this resistor is the same as the trace impedance.
Source termination works differently than end termination, but the net effect is the same. It works by controlling signal reflections, not preventing the reflections in the first place. Because of this, it only works if a driver output is feeding a single load. If there are multiple loads then something else should be done (like using end termination or multiple source termination resistors). The huge benefit of source termination is that it does not load down your driver like end termination does.
I said before that your series resistor for source termination must be located at the driver, and it must have the same value as your trace impedance. That was an oversimplification. There is one important detail to know about this. Most drivers have some resistance on it's output. That resistance is usually in the 10-30 ohm range. The sum of the output resistance and your resistor must equal your trace impedance. Let's say that your trace is 50 ohms, and your driver has 20 ohms. In this case your resistor would be 30 ohms since 30+20=50. If the datasheets do not say what the output impedance/resistance of the driver is then you can assume it to be 20 ohms-- then look at the signals on the PCB and see if it needs to be adjusted.
Another important thing: when you look at these signals on an o-scope you MUST probe at the receiver. Probing anywhere else will likely give you a distorted waveform and trick you into thinking that things are worse than they really are. Also, make sure that your ground clip is as short as possible.
Conclusion: Switch to source termination with a 33 to 50 ohm resistor and you should be fine. The usual caveats apply.
Since you are going short distances, I don't think termination resistors are a good idea. As you found, they have to be quite low to do the job, and then the line draws a lot of current and the voltage is attenuated by 2 if you also drive the line with the same impedance.
Your clock rate isn't all that high, so the frequencies you need to support even 4 MHz bit rate aren't the ones causing the trouble. The problem is you have fast edges driving the lines, which have harmonics in the 100s of MHz, which do cause the trouble. At your desired frequencies, you have a lumped system, not a transmission line. This makes things considerably easier.
The solution therefore is to attenuate the high frequencies that you don't really need but cause the trouble. This can be done with a simple R-C low pass filter immediately after anything that drives a line. This is in part what the 330 Ω resistors are doing now. They form a low pass filter with the parasitic capacitance of the line. Apparently that is not quite enough and/or is not predictable enough. This can be fixed with some deliberate capacitance on each line.
You want to run the bus at 4 MHz, which means the fastest signal it needs to support is a 4 MHz square wave. That means the length of each level is 125 ns. Let's say we want that to be at least 4 time constants, which implies 98% settling time. That means the maximum time constant we want to allow is 31 ns. 31ns / 330Ω = 94 pF. That is the total load on the 330 Ω series resistors you need to get the 31 ns time constant. There will always be some parasitic capacitance you can't predict, so I'd see how things look with 47 pF. That leaves room for 10-20 pF of hidden capacitance while not exceeding our maximum allowed time constant.
The series resistors should be as close as possible to all pins that drive the bus. This assumes that all other pins on the bus will be CMOS inputs when one is driving. For lines that are only ever driven by a single pin (like the clock line, which is only driven by the master), put the 47 pF as close as possible after the resistor. For lines that can be driven by different pins at different times (like MISO), put the 47 pF somewhere near the middle of all the drivers. Each line gets only a single 47 pF capacitor no matter how many drivers, but there is one resistor for each driver.
The calculations above are meant to be a good guide to start with. Some parameters can't be known and therefore accounted for up front. Start with the 330 Ω in series and 47 pF to ground, but don't be afraid to change things based on real observed results.
In the absence of any termination, when a signal is sent from a very-low-impedance source to a very-high-impedance receiver, the signal will bounce back and forth repeatedly; the phase of the signal will be flipped 180 degrees on each round trip.
If one does not wish to have signals reflected when they hit the destination, one may use end termination. This will cause the signal to be cleanly absorbed at the destination without being reflected, but many common implementations will cause the source to see a significant DC load.
In many cases, one may achieve results that are just as practically useful if one instead inserts a series resistor at the signal source. If there is no receiver at the far end of the line, the signal will be reflected when it gets there, but any such reflection will be absorbed by the source rather than re-reflected. Note also that source termination does not impose a DC load on the device driving the line.
In the absence of termination, if a line is driven by a low impedance and received with a high impedance, the receiving device may see a voltage higher than the driving voltage (in theory, up to twice the voltage, if the source driving impedance is zero and the receiving impedance is infinite). If either the source or receiver is properly terminated, the received voltage will be nearly equal to the drive voltage (if a zero-impedance source were driving a properly-impedance receiver, or a proper-impedance source were driving an infinite-impedance receiver, received voltage will equal drive voltage). If both are properly terminated, receive voltage will be half the drive voltage.
The simulation here demonstrates this. It includes a pulse generator which produces a chain of pulses roughly 49 times per second, two 5ms delay lines in series (round-trip time 1/50 second), and switchable termination resistors at both ends.
The circuit includes three SPDT switches; click on one to change its state. The lower two switches control source and destination termination. For those, "up" represents good termination and "down" represents bad. The upper switch controls whether the line should be driven by an automatic pulse generator, or by a manual logic input. To send pulses down the line manually, switch the upper switch "down", and then clock the "L" next to it.
The signals reaching the destination will be clean if either the source or the destination is properly terminated. If both are properly terminated, the received signal voltage will be half of the drive voltage. If one is properly terminated but the other is not, the received voltage will be about 91% of the drive voltage (the "bad" resistors are "wrong" by a factor of ten, and thus fail to absorb about (10/11) of the energy). If neither is terminated, the received voltage will initially be about 1.656 times the drive voltage, but weird reflections will appear every 20ms.