Batch Norm and Covarient shift

MustafaaShebl · September 13, 2024, 9:30pm

In the video prof Ng mentioned that Batch Norm is used to have a standard distribution of the activations of previous layers, but we are learning beta and gamma with each iteration in the mini-batch so the distribution will not be the same for different mini-batches, Can anyone help me with this conflict in my mind?

paulinpaloalto · September 13, 2024, 11:45pm

When we are in training mode, everything changes with each minibatch iteration, right? Both the weight and bias values at every layer of the network as well as the \beta and \gamma values. In the inner layers of the network, the inputs are affected by the weights of the previous layers and those are being learned (changed) with every iteration. But the learning for both the weights and the BN parameters are cumulative over the training: if your hyperparameter choices are good, then everything will converge including both the usual parameters (W and b) as well as the BN parameters.

MustafaaShebl · September 14, 2024, 12:45am

So to ensure my understanding, I should treat all parameters adjustment as a whole, as if all changes lead to a same standerdized layer (for one layer while training) to minimize the effect of the later layers, Did I get it right?

paulinpaloalto · September 14, 2024, 12:51am

Well, I don’t know that I would say that the goal is a “same standardized layer” or that you are trying to minimize the downstream effect. The point is you want a model that works, right? Meaning that it makes accurate predictions. So whatever that requires in terms of what happens at the various layers is the goal. The point of Batch Normalization is that it makes it easier for the training to succeed by minimizing the covariate shift on the inputs to each layer of the network. The training learns both the \beta and \gamma parameters that make that happen as well as the W and b values that play the key role in what actually happens at each layer. In that sense I agree that we are training the whole ensemble here. The parts all interact and affect each other.

MustafaaShebl · September 14, 2024, 2:37am

Then it is better to view it as functions performing what we need for the network to learn (like learning parameters W and b for layer computations as well as learning beta and gamma for reducing covarient shift).

Topic		Replies	Views
Batch Normalization Intuition Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	575	November 22, 2022
Question about batch norm Improving Deep Neural Networks: Hyperparameter tun coursera-platform	6	571	April 26, 2023
Why batch norm works? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	513	November 17, 2021
Week 3: Why Batch Norm Works Improving Deep Neural Networks: Hyperparameter tun coursera-platform	6	595	October 26, 2021
Batch Norm reducing internal covariate shift Improving Deep Neural Networks: Hyperparameter tun coursera-platform	7	381	September 26, 2023

Batch Norm and Covarient shift

Related topics