I appreciate help to understand the following:
- In the Identity and Convolutional Block, Batch Normalization is used, but applied not as in the initial stage of the block but only after CONV2D. Why?
Thank you.
I appreciate help to understand the following:
Thank you.
This is interesting question, actually.
The reason why we include BatchNorm is to normalize “internal covariate shift” of outputs which can be seen in a deeper network, or just by a batch. And, if we look at the result, it worked effectively.
But, the location to put BatchNorm is really case-by-case. As a covariate shift can be seen in the deeper network, we may start to insert just after a big operation like Conv2D. But, a covariate shift also can be seen by different batch (mini-batch). So, we may want to insert more…
Then, test and get results for tuning.
Even in the case of the residual network, they tried to put at different locations. For example, a key decision is whether BatchNorm should be put before merging a shortcut or after… As the test result, they got a good result when they put BatchNorm before merging a shortcut.
Like this, everything is trial-and-error oriented.
And, more annoying thing is, “why Batchnorm works” is still a research theme. A recent paper, Understanding Batch Normalization raised some doubts about the original paper to discuss about internal covariate shift, and, another recent work by different researcher is High-Performance Large-Scale Image Recognition Without Normalization, which does not use BatchNorm. And, performance of this NFNet (Normalizer-Free Net) is better than recent EfficientNet/LambdaNet.
So, I should say there is no concrete guideline. Let’s try and select the best one.
Hi Nobu,
Thanks very much for your quick and clear answer, making me understand why I did not understand in the beginning! There is no actual reason why, just trial and error!
Much appreciated.
Have a great weekend.
Best regards,
Antonio