What is the physical/electrical limit of audio quality?
What is the maximum physical/real/electronic bitsize in which a piece of audio can be stored?
As Dzarda comments, this is not a sensible question, and it is not clear what you mean by 'piece'. If you mean sample, you can store it in as many bits as you can store. Typical HDs contain 1 TB and more, so 8 Tera Bits would be within reach.
will 32bit audio be overkill/contain too much noise?
It is overkill in the same way that it makes no sense to protect your bike with a very heavy chain that is closed with a soft plastic padlock. You'd better spend less money on the chain and use it to buy a better padlock.
Let's for the sake of argument say that the signal/noise ratio from the analog parts of your audio system corresponds to 16 bits. If you play back digital sound stored as 18 bits that adds ~3% of that level worth of noise: it increased the noise by ~3%. (from 100 to 125, in arbitrary units). 20 bits will increase it by 0.7%. 32 bits by 0.00098%. That is: assuming you have a perfect translation from digital to analog.
The cost of storage increases linearly with bit size, the cost of a full-range-accurate D/A converter raises almost exponentially when you approach a certain number of bits (~22?). So using more bits than the equivalent quality in the analog parts costs more, but the gain in quality diminishes. So it is simply not economical to use more bits: if you want to spend more money to get better quality, you should spend it on the analog parts. (I am not a audiophile, but AFAIK the speaker is often the weakest link.)
This is a common theme in engineering: it is not about doing individual parts as good as possible, but about a balanced design.
Technology could allow you to store (almost) infinitely big (samples/sec) and infinitely deep (bits) data, and in fact lots of things do store this sort of thing: there's plenty of cameras that can record faster & higher detail than human eyes can see, for example 500 frames per second. Likewise there's scientific instruments such as seismometers which are (simplistically) a lot like microphones but far more sensitive than the human ear, and the recorded data is probably stored in more detail than a human could directly interpret if it was played back at real-world levels. However, these various devices are almost always used to capture things so we can analyse them in some other way: a wave on a graph, a slow-motion video, etc.
Going back to audio recording & playback, again there are scientific & test instruments which can sample, record, reproduce & generate far better quality (as in resolution/depth/accuracy) signals than humans can process, but there's not much point in having them in a recording studio.
Now, in a really good multi-track studio you might want better quality than humans can discern as you are adding lots of things together, so the less error you introduce the better it'll come out in the final mix. Simplistically again; if you do all the hard sums using 4 decimal places your final answer may only need to be to 1 decimal place but might still come out better as you won't have lost as much in rounding errors.
In the final case (human consumption) there is only so much humans can discern so equipment is generally made to be good enough for that, because why would you do more work for no gain?
As an example: digital imaging has topped out at 8-bits-per-colour because the eye can't distinguish more than about 256 shades of grey / the total combination of 16.8 million colours & shades. We have 64-bit PCs and much better digital cameras these days, we could store 16 bits per colour, but people can't see 281,474,976,710,656 different colours and we'd waste a lot of effort capturing & storing that data.
Likewise, no-one will pay for a recording studio full of equipment that can hear, capture, record, and reproduce a fly farting at the back of the room over someone bashing a drumkit as no-one will ever hear it, even if it's there.
Fun.. to play with some numbers. Let's assume 1 k ohm of source impedance. (You have to assume something.) So that's got ~4nV/rtHz of Johnson noise. For a 10kHz bandwidth, that's ~400nV of noise. OK and assume it's gained up to 5 Volts and stored. That's about 10^7 in dynamic range... 23 bits. (In real life there will be more noise...)