Why batch normalization over channels only in CNN
I was puzzled by this for a few hours, as it doesn't make sense to normalize per channel - as every channel in a conv-net is considered a different "feature". I.e. normalizing over all channels is equivalent to normalizing number of bedrooms with size in square feet (multivariate regression example from Andrew's ML course). This is not what normalization does - what you do is normalize every feature by itself. I.e. you normalize the number of bedrooms across all examples to be with mu=0 and std=1, and you normalize the the square feet across all examples to be with mu=0 and std=1.
After checking and testing it myself I realized what's the issue: there's a bit of a confusion/misconception here. The axis you specify in Keras is actually the axis which is not in the calculations. i.e. you get average over every axis except the one specified by this argument. This is confusing, as it is exactly the opposite behavior of how NumPy works, where the specified axis is the one you do the operation on (e.g. np.mean, np.std, etc.). EDIT: check this answer here.
I actually built a toy model with only BN, and then calculated the BN manually - took the mean, std across all the 3 first dimensions [m, n_W, n_H] and got n_C results, calculated (X-mu)/std (using broadcasting) and got identical results to the Keras results.
So I'm pretty sure about this.
In CNN for images, normalization within channel is helpful because weights are shared across channels. The figure from another paper shows how we are dealing with BN. It's helpful to understand better.
Figure taken from
Wu, Y. and He, K., 2018. Group normalization. arXiv preprint arXiv: 1803.08494.
As soon as I know, in feed-forward (dense) layers one applies batch normalization per each unit (neuron), because each of them has its own weights. Therefore, you normalize across feature axis.
But, in convolutional layers, the weights are shared across inputs, i.e. each feature map applies same transformation to a different input's "volume". Therefore, you apply batch normalization using mean and variance per feature map, NOT per unit/neuron.
That's why I guess there is a difference in axis
parameter value.