In the video prof Ng mentioned that Batch Norm is used to have a standard distribution of the activations of previous layers, but we are learning beta and gamma with each iteration in the mini-batch so the distribution will not be the same for different mini-batches, Can anyone help me with this conflict in my mind?
When we are in training mode, everything changes with each minibatch iteration, right? Both the weight and bias values at every layer of the network as well as the \beta and \gamma values. In the inner layers of the network, the inputs are affected by the weights of the previous layers and those are being learned (changed) with every iteration. But the learning for both the weights and the BN parameters are cumulative over the training: if your hyperparameter choices are good, then everything will converge including both the usual parameters (W and b) as well as the BN parameters.
So to ensure my understanding, I should treat all parameters adjustment as a whole, as if all changes lead to a same standerdized layer (for one layer while training) to minimize the effect of the later layers, Did I get it right?
Well, I don’t know that I would say that the goal is a “same standardized layer” or that you are trying to minimize the downstream effect. The point is you want a model that works, right? Meaning that it makes accurate predictions. So whatever that requires in terms of what happens at the various layers is the goal. The point of Batch Normalization is that it makes it easier for the training to succeed by minimizing the covariate shift on the inputs to each layer of the network. The training learns both the \beta and \gamma parameters that make that happen as well as the W and b values that play the key role in what actually happens at each layer. In that sense I agree that we are training the whole ensemble here. The parts all interact and affect each other.
Then it is better to view it as functions performing what we need for the network to learn (like learning parameters W and b for layer computations as well as learning beta and gamma for reducing covarient shift).