Q1. How do we initialize the values for Γ and β to find Ž(i)?
-
If we initialize Γ = sqrt(σ² + ɛ) and β = μ, then Ž(i) = Z(i) on the first step of forward propagation. Although, we’ll be updating parameters Γ and β, but it doesn’t make sense to me why would we do that.
-
If we initialize Γ = 1 and β = 0, then Ž(i) = Z(i)_norm on the first step of forward propagation.
Q2. To update the values of Γ and β, we take derivatives of Γ and β w.r.t Cost Function and then update the values, like we update the values for W and b. This means that the values of Γ and β will be updated to minimize the Cost Function, which is not the purpose of updating those values. Our purpose for updating the values was to fasten the process of gradient descent converging. How does updating the values of Γ and β help fasten the process of gradient descent?
Q3. What will be the dimensions of Γ and β, if we have m examples rather than 1 example?