Batch Normalization Intuition

Greetings everyone,

I’d like to ask few questions regarding the intuition on why Batch Normalization works, as the theory presented in classes is a bit unintuitive to me.

  1. It’s not clear to me why the distribution of the Z values would change during training in a more ‘controlled’ way when using Batch norm, given that the β and γ parameters are also trained alongside (without knowing the effect of gradient descent on them). Why wouldn’t backprop cancel out the effect of batch norm using the β and γ parameters?

  2. When I think of the analogy of the covariate shift in the image datasets (with black and colored cats), couldn’t we argue that during training time, a shift in the distribution of the features could favor better generalization? (in the same way training with both black and colored cats would)

  3. Finally, it seems to me that a small batch size (e.g. 32 samples, typically used in many applications) would induce a significant amount of noise in the estimations of the mean/variance. Is there any reason why we wouldn’t use an exponentially weighted average estimation for both of these to calculate the Znorm during training? (as we do during testing). My intuition would be that by using these converging values systematically, we would derive a training that would better translate to the estimations derived during testing (mean/variance resulting from the EWA of all samples)

I’d be very interested to hear any of your thoughts!

Best regards,

Hi EscapisGR,

Welcome to DeepLearning.AI!!!
My thoughts on Batch Norm are that in a network with many layers, there’s an internal covariate shift for the input of a layer and this can make the size of the signals in the network inappropriate. We know that the output of that layer is going to go into the input of the next layer so we need our signals to be in a normalized range. Because of that, Batch Norm comes to the rescue. It helps balance the pre-activations and avoid gradient vanishing or explosion so we can train a deeper and deeper network. To my knowledge, Batch Norm also has an effect of regularization. That’s why we should use Batch Norm to stabilize our networks.
For more information on Batch Normalization, you can check out this URL below

Best regards,