Hi, I’m curious about why the equation of gradient descent with momentum here doesn’t rely on the previous gradient. That is, I remember in the definition of exponential weighted moving average, v_t = \beta * v_{t-1} + (1-\beta)*\theta_t . It’s because we connect v_t with its previous elements so that we can have a smoother diagram.

But in the case of gradient descent with momentum, why don’t we involve v_{dW^{[l-1]}}?

Also, we initialized all v_{dW^{[l]}} with zeros, what’s the point of multiplying \beta and zero here?

1 Like

The update rule will be applied for each layer, as we have parameters associated with each layer separately. Therefore, the term v_{t−1}​ represents the velocities calculated at the previous time step or previous iteration for the layer l, not the velocities from the previous layer l-1.

After a few iterations, multiplying \beta would make sense, since (1 - \beta)dW^{[l]} and (1 - \beta)db^{[l]} are also being added.

1 Like

Hi @Well_Zhang,

In addition to @Mujassim_Jamal’s explanation especially on the meaning of t and [l],

The previous gradients are there, then you update them with dW^{[l]} or db^{[l]} and they become the “current” gradients.


1 Like

Thank you. As we call update_parameters_with_momentum() in each iteration, the old v_{dW^{[l]}} here is actually from the last iteration and we update it in each new iteration.

Yes, you are right …

1 Like