Course 2, Week 2, suggest for Gradient Descent with Momentum

I suggest the formula to compute v_dw and v_db should be rewritten to explain how it could become a momentum for gradient descent

The formula in lesson:
v_dW := beta * v_dW + (1-beta) *dW
W: = W - alpha * v_dW

The rewritten formula:
v_dW := dW + beta * (v_dW - dW)
W := W - alpha * v_dW = W - [ alpha *dW + alpha * beta * (v_dW - dW) ]

So we could see the component beta * (v_dW - dW) is the momentum for gradient descent.

If v_dW < dW : it is a negative acceleration that limits the oscillation
If v_dW > dW: it is a positive acceleration to speed up convergence

That’s an interesting point! Of course they wrote it the way they did to emphasize the point that what is being done there is effectively using the Exponential Weighted Average of dW. It would be nice to add your formulation in the lectures or in the explanations in the assignment to give more intuition about why it is useful and how it achieves that effect. Thanks for pointing this out!

1 Like