Hi, I’m curious about why the equation of gradient descent with momentum here doesn’t rely on the previous gradient. That is, I remember in the definition of exponential weighted moving average, v_t = \beta * v_{t-1} + (1-\beta)*\theta_t . It’s because we connect v_t with its previous elements so that we can have a smoother diagram.
But in the case of gradient descent with momentum, why don’t we involve v_{dW^{[l-1]}}?
Also, we initialized all v_{dW^{[l]}} with zeros, what’s the point of multiplying \beta and zero here?
1 Like
The update rule will be applied for each layer, as we have parameters associated with each layer separately. Therefore, the term v_{t−1} represents the velocities calculated at the previous time step or previous iteration for the layer l
, not the velocities from the previous layer l-1
.
After a few iterations, multiplying \beta would make sense, since (1 - \beta)dW^{[l]} and (1 - \beta)db^{[l]} are also being added.
1 Like
Hi @Well_Zhang,
In addition to @Mujassim_Jamal’s explanation especially on the meaning of t and [l],
The previous gradients are there, then you update them with dW^{[l]} or db^{[l]} and they become the “current” gradients.
Cheers,
Raymond
1 Like
Thank you. As we call update_parameters_with_momentum() in each iteration, the old v_{dW^{[l]}} here is actually from the last iteration and we update it in each new iteration.