Batch normalization gradient computation question

olgeorge · August 5, 2022, 4:57am

Normalizing the input layer features (A[0]) makes sense because that’s just skewing the whole dataset according to some rule.

However, I’m struggling to understand how we can just “normalize” the hidden layer inputs without really breaking the backpropagation formulas.

What I mean is, previously we would calculate: dW[2] = 1/m * np.dot(dZ[2], A[1].T)

But that’s because: Z[2] = np.dot(W[2], A[1]) + b[2]

With batch normalization though, we have a new step in the computation graph, which looks much more complex:
m = 1/m * np.sum(Z[2])
sigma2 = 1/m * np.sum((Z[2]-m) ** 2)
Znorm[2] = (Z[2] - m) / (np.sqrt(sigma2) + epsilon)
Ztilda[2] = gamma * Znorm[2] + beta

So previously, we had:
dL/dZ[2] = dL/dA[2] * dA[2]/dZ[2]

But now we need to have an intermediate step:
dL/dZ[2] = dL/dA[2] * dA[2]/dZtilda[2] * dZtilda[2]/dZ[2]

And it looks like dZtilda[2]/dZ[2] would be insanely complex to calculate explicitly.

However, I don’t see this being mentioned in the lecture. Are we ignoring this side effect of batch normalization? If so, how do we even know that our gradient descent is ever going towards the minimum of the loss function?

anon57530071 · August 6, 2022, 4:06am

It is practical to check it is differentiable. Otherwise, it will be more burden to the computer as you thought. So, it’s a good point.

If we look at a paper, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, authors checked it, of course.

Here is their work.

Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal co- variate shift, thus accelerating the training.

From the computer perspective , as far as it can be written as a formula, it is OK. There will be no big differences in computational resource usages.

The recent topics related with Batch norm, is… removing it from another aspect. I do not want to make you confused, but here is the link to an interesting paper.

There are still lots of works required in this area…

olgeorge · August 16, 2022, 3:06am

Hi @anon57530071 ,

Thanks a lot for your response and for the link to the original paper! It makes sense that the function is still differentiable (why would it not be!). However, what I’m trying to understand is this:

If we were to implement batch gradient descent ourselves (without tensorflow, like we did in previous exercises), would we have to adjust our formulas for back-propagation, or would we just keep them as they were before, practically ignoring the fact that there is an extra step in the computational graph, hoping that this would somehow magically work?

anon57530071 · August 17, 2022, 1:02am

Backward prop is really a reverse operation of forward-prop. Otherwise, proper gradients can not be delivered to update parameters. In this sense, both forward prop and backward prop needs to be re-written to include Batch norm equations.

(I’m leaving this community. If you have further question, please create a new thread so that other mentors can easily find your topics and follow up.)

Topic		Replies	Views
Batch Normalization propagation Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	523	April 22, 2022
Week 3 - How will batch normalization affect backpropagation? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	4	1116	April 29, 2022
Batch Norm Backprop Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	600	May 4, 2022
Batch Normalization Gradients Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	565	February 28, 2022
Batch Norm Gradients Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	545	June 30, 2021

Batch normalization gradient computation question

Related topics