Normalizing the input layer features (A[0]) makes sense because that’s just skewing the whole dataset according to some rule.

However, I’m struggling to understand how we can just “normalize” the hidden layer inputs without really breaking the backpropagation formulas.

What I mean is, previously we would calculate: dW[2] = 1/m * np.dot(dZ[2], A[1].T)

But that’s because: Z[2] = np.dot(W[2], A[1]) + b[2]

With batch normalization though, we have a new step in the computation graph, which looks much more complex:
m = 1/m * np.sum(Z[2])
sigma2 = 1/m * np.sum((Z[2]-m) ** 2)
Znorm[2] = (Z[2] - m) / (np.sqrt(sigma2) + epsilon)
Ztilda[2] = gamma * Znorm[2] + beta

So previously, we had:
dL/dZ[2] = dL/dA[2] * dA[2]/dZ[2]

But now we need to have an intermediate step:
dL/dZ[2] = dL/dA[2] * dA[2]/dZtilda[2] * dZtilda[2]/dZ[2]

And it looks like dZtilda[2]/dZ[2] would be insanely complex to calculate explicitly.

However, I don’t see this being mentioned in the lecture. Are we ignoring this side effect of batch normalization? If so, how do we even know that our gradient descent is ever going towards the minimum of the loss function?

Thus, BN transform is a differentiable transformation that introduces normalized activations into the network. This ensures that as the model is training, layers can continue learning on input distributions that exhibit less internal co- variate shift, thus accelerating the training.

From the computer perspective , as far as it can be written as a formula, it is OK. There will be no big differences in computational resource usages.

The recent topics related with Batch norm, is… removing it from another aspect. I do not want to make you confused, but here is the link to an interesting paper.

There are still lots of works required in this area…

Thanks a lot for your response and for the link to the original paper! It makes sense that the function is still differentiable (why would it not be!). However, what I’m trying to understand is this:

If we were to implement batch gradient descent ourselves (without tensorflow, like we did in previous exercises), would we have to adjust our formulas for back-propagation, or would we just keep them as they were before, practically ignoring the fact that there is an extra step in the computational graph, hoping that this would somehow magically work?

Backward prop is really a reverse operation of forward-prop. Otherwise, proper gradients can not be delivered to update parameters. In this sense, both forward prop and backward prop needs to be re-written to include Batch norm equations.

(I’m leaving this community. If you have further question, please create a new thread so that other mentors can easily find your topics and follow up.)