Residual_Networks - BN - Channel Axis

I appreciate help to understand the following:

  1. In the Identity and Convolutional Block, Batch Normalization is used, but applied not as in the initial stage of the block but only after CONV2D. Why?

Thank you.

This is interesting question, actually.

The reason why we include BatchNorm is to normalize “internal covariate shift” of outputs which can be seen in a deeper network, or just by a batch. And, if we look at the result, it worked effectively.

But, the location to put BatchNorm is really case-by-case. As a covariate shift can be seen in the deeper network, we may start to insert just after a big operation like Conv2D. But, a covariate shift also can be seen by different batch (mini-batch). So, we may want to insert more…
Then, test and get results for tuning.
Even in the case of the residual network, they tried to put at different locations. For example, a key decision is whether BatchNorm should be put before merging a shortcut or after… As the test result, they got a good result when they put BatchNorm before merging a shortcut.
Like this, everything is trial-and-error oriented.

And, more annoying thing is, “why Batchnorm works” is still a research theme. A recent paper, Understanding Batch Normalization raised some doubts about the original paper to discuss about internal covariate shift, and, another recent work by different researcher is High-Performance Large-Scale Image Recognition Without Normalization, which does not use BatchNorm. And, performance of this NFNet (Normalizer-Free Net) is better than recent EfficientNet/LambdaNet.

So, I should say there is no concrete guideline. Let’s try and select the best one. :wink:

Hi Nobu,

Thanks very much for your quick and clear answer, making me understand why I did not understand in the beginning! There is no actual reason why, just trial and error!

Much appreciated.

Have a great weekend.

Best regards,
Antonio