Week 3 - How will batch normalization affect backpropagation?

Hello, I have a question on the week 3 material on Batchnorm (note that I have not made it all the way through the lectures yet, however, the batchnorm lectures are done).

When we do batchnorm, what will the backpropagation equations look like? Since in my understanding batch norm changes the relationship between A[l] and A[l+1] or A[l] and W[l+1], won’t this be reflected in the derivatives? (i.e the gamma parameter that is multiplied should show up). I think the video also doesn’t explain how dgamma or dbeta for batchnorm parameters are computed, which I believe is also related to this question.

Is there a concise resource on this for quick reference?

(Or am I completely wrong and backpropagation stays unaffected? )

Please read this paper

Hi, @nvarma!

The main purpose of batch normalization is the standardization of the layer inputs, which is basically trying to make the input probability distribution mean = 0 and std dev = 1. To accomplish that, the batch normalization layer will learn the parameters \beta and \gamma during training.

The backpropagation equations for the batch normalization look like this, as they are shown in the paper @balaji.ambresh said:

During training we need to backpropagate the gradient of loss ℓ through this transformation, as well as compute the gradients with respect to the parameters of the BN transform. We use chain rule, as follows (before simplification):

As a quick clarification to the original paper from 2015: Santurkar, Tsipras, Ilyas & Madry published three years later this paper which explained the main reason behind the benefits of using BN. It didn’t reduce the internal covariate shift but made the optimization loss smoother. That is what really makes the model train better.

Hi, @nvarma. I have move this thread from Course 1 to Course 2 where it seemed to belong. You should receive more attention here.

Hi everyone,

Just trying to make things more simplified for the fresh learners here to what Balaji and Alva Roramajo tried to explain earlier to this query. To those who aren’t aware of the fancy term called “internal covariate shift”, let me explain it for you.

So, from where this term came out actually? When we apply batch normalization as a feature, which is added between the layers of the neural network, it continuously takes the output from the previous layer and normalizes it before sending to the next layer, Right!

batchnorm2_pic

A phenomena called internal covariate shift happens, whenever there is a change in the input distributions to the network. So, when this input distribution changes, hidden layers try to learn to adapt to the new distribution. This actually slows down the training process and kills much of the time, which we are always worried off while training the model :slight_smile:

In order to maintain the similar distribution of the data, we use batch normalization technique to normalize the outputs using mean=0, std=1. Using this technique, the model is trained faster and the accuracy too increases as compared to the model where we don’t use batchnorm.

In practice, we add batchnorm after the activation functions of the output layer or before the activation functions of the input layer.

Now, what N varma has tried to ask, during this whole process, does batchnorm impact the derivatives too? The back propagation step of batch normalization computes the derivatives of gamma(dgamma) and beta(dbeta). The gamma is used to scale the normalized value and the beta is used to shift then up or down, eliminating the need for bias).

The original paper doesn’t mention anything on how to learn them. But yes, the paper as mentioned by Alva Roramajo does say that it makes the optimization landscape significantly smoother without having any much impact on internal covariate shift.

1 Like