Hardware debouncing of key matrix with minimum passive components
Let's be clear about this. If you have a keypad matrix you are already using processing power to apply sequential logic voltages to the rows or columns then reading the columns or rows back in order to determine the button pressed.
So, each time you get a "result" i.e. you detect that a button has been pressed, you mark that event as "pending" and some time later (10 to 20 ms) you check again to see if the button press you marked as "pending" can be judged to be "actual".
How much more processing time this needs is very little in the bigger scheme in my opinion and if you are so close to the limit at which your CPU can operate then get a bigger/faster CPU or increase the clock speed.
Using Rs and Cs can work but, in all cases it will produce a "slow" output that would need to be schmitt triggered to clean up the slow edge to a fast edge that is suitable for the logic that follows. You might get away with it of course but then you have a fixed solution with no flexibility.
Having said all the above, you might also need capacitors from each matrix line to ground to avoid ESD/EMC issues.
To get the best debounce performance, it would likely be best to have one R/C for each button. However, you should still get decent results with one per row/column. Just depends how critical it is really. If you want to do it with the minimum amount of components, why don't you try doing one per row/column first, then taking some measurements and seeing if the result is good enough for your application?
If the results aren't what you wanted, then go ahead and add some on each button, then try again.
As a long-time embedded software engineer, I have to say that your assumption that debouncing will take some processing power out of my application is simply incorrect. This will never be true for any competently-written firmware.
Naturally, debouncing will require some processing. However the processing is trivial, and for user input will be happening at such a low update rate as to be utterly negligible. If you needed to debounce inputs with update rates in tens of kHz, perhaps the processing for debouncing would be significant, but a human pressing buttons does not need anything like that kind of resolution. In your case, 100Hz sampling would be easily fast enough, and you could almost certainly drop it as low as 10Hz sampling without seriously affecting your user interaction.
If you're trying to do input processing in a main control loop running at tens of kHz, of course it'll suck processing power. The correct solution is to write firmware which does not do it that way though, not to use a hardware solution to fix a software anti-pattern. Appropriate use of timers and interrupt priorities will give you what you need.
You can optimise the processing by making sure the read-back is all on one I/O port. Assuming that you're setting levels on columns and reading back the rows, then you bit-AND, bit-shift and bit-OR to build up a 16-bit value for the 16 pins. XOR this with the previous 16-bit value, and if this is non-zero then something changed. A simple debounce algorithm is just to set a counter to a value if the pins change state, pick a state if the pins kept their state and the counter is zero, and decrement if it's not zero.
You do need to check only one button is pressed, of course. If you've got an ARM processor, the ARM has an instruction to report how many bits are set, which is ideal for this. Just mentioning for a further optimisation.