I’d like to ask few questions regarding the intuition on why Batch Normalization works, as the theory presented in classes is a bit unintuitive to me.
It’s not clear to me why the distribution of the Z values would change during training in a more ‘controlled’ way when using Batch norm, given that the β and γ parameters are also trained alongside (without knowing the effect of gradient descent on them). Why wouldn’t backprop cancel out the effect of batch norm using the β and γ parameters?
When I think of the analogy of the covariate shift in the image datasets (with black and colored cats), couldn’t we argue that during training time, a shift in the distribution of the features could favor better generalization? (in the same way training with both black and colored cats would)
Finally, it seems to me that a small batch size (e.g. 32 samples, typically used in many applications) would induce a significant amount of noise in the estimations of the mean/variance. Is there any reason why we wouldn’t use an exponentially weighted average estimation for both of these to calculate the Znorm during training? (as we do during testing). My intuition would be that by using these converging values systematically, we would derive a training that would better translate to the estimations derived during testing (mean/variance resulting from the EWA of all samples)
I’d be very interested to hear any of your thoughts!