Where to apply batch normalization on standard CNNs

The original batch-norm paper prescribes using the batch-norm before ReLU activation. But there is evidence that it's probably better to use batchnorm after the activation. Here's a comment on Keras GitHub by Francois Chollet:

... I can guarantee that recent code written by Christian [Szegedy] applies relu before BN. It is still occasionally a topic of debate, though.

To your second question: in tensorflow, you can use a high-level tf.layers.batch_normalization function, or a low-level tf.nn.batch_normalization.


In addition to the original paper using batch normalization before the activation, Bengio's book Deep Learning, section 8.7.1 gives some reasoning for why applying batch normalization after the activation (or directly before the input to the next layer) may cause some issues:

It is natural to wonder whether we should apply batch normalization to the input X, or to the transformed value XW+b. Ioffe and Szegedy (2015) recommend the latter. More specifically, XW+b should be replaced by a normalized version of XW. The bias term should be omitted because it becomes redundant with the β parameter applied by the batch normalization reparameterization. The input to a layer is usually the output of a nonlinear activation function such as the rectified linear function in a previous layer. The statistics of the input are thus more non-Gaussian and less amenable to standardization by linear operations.

In other words, if we use a relu activation, all negative values are mapped to zero. This will likely result in a mean value that is already very close to zero, but the distribution of the remaining data will be heavily skewed to the right. Trying to normalize that data to a nice bell-shaped curve probably won't give the best results. For activations outside of the relu family this may not be as big of an issue.

Some report better results when placing batch normalization after activation, while others get better results with batch normalization before activation. It's an open debate. I suggest that you test your model using both configurations, and if batch normalization after activation gives a significant decrease in validation loss, use that configuration instead.


There's some debate on this question. This Stack Overflow thread and this keras thread are examples of the debate. Andrew Ng says that batch normalization should be applied immediately before the non-linearity of the current layer. The authors of the BN paper said that as well, but now according to François Chollet on the keras thread, the BN paper authors use BN after the activation layer. On the other hand, there are some benchmarks such as the one discussed on this torch-residual-networks github issue that show BN performing better after the activation layers.

My current opinion (open to being corrected) is that you should do BN after the activation layer, and if you have the budget for it and are trying to squeeze out extra accuracy, try before the activation layer.

So adding Batch Normalization to your CNN would look like this:

Conv1
Relu1
BatchNormalization
Pooling1
Conv2
Relu2
BatchNormalization
Pooling3
FullyConnect1
BatchNormalization
FullyConnect2
BatchNormalization