C2 W3 Normalizing Activations in a Network

Q1. How do we initialize the values for Γ and β to find Ž(i)?

  1. If we initialize Γ = sqrt(σ² + ɛ) and β = μ, then Ž(i) = Z(i) on the first step of forward propagation. Although, we’ll be updating parameters Γ and β, but it doesn’t make sense to me why would we do that.

  2. If we initialize Γ = 1 and β = 0, then Ž(i) = Z(i)_norm on the first step of forward propagation.

Q2. To update the values of Γ and β, we take derivatives of Γ and β w.r.t Cost Function and then update the values, like we update the values for W and b. This means that the values of Γ and β will be updated to minimize the Cost Function, which is not the purpose of updating those values. Our purpose for updating the values was to fasten the process of gradient descent converging. How does updating the values of Γ and β help fasten the process of gradient descent?

Q3. What will be the dimensions of Γ and β, if we have m examples rather than 1 example?

That is not how the training of the Batch Norm parameters is done. Here’s the Keras documentation page about that. Rather than computing gradients w.r.t. Cost Function, it is computing the mean and standard deviation of the batches that it is seeing (during training) and then using those saved values (during inference). Although there are parameters you can use to control when the updating of the moving averages is recomputed. Please see the doc page for details.

The shapes of those parameters are independent of the batch size. It just uses broadcasting to do the elementwise operations across the “samples” dimension.