Question Regarding Scaling of V_(dw) and V_(db)

Course video: Week #2 (Link: https://www.coursera.org/learn/deep-neural-network/lecture/y0m1f/gradient-descent-with-momentum)

Hello, I wanted some clarifications regarding the omission of (1-B) when implementing gradient descent with momentum. Using V_(dw) as an example, the original equation is V_(dw) = BV_(dw) + (1-B)dw. Dr. Ng says that when implementing it, we can omit the (1-B) so the equation becomes V_(dw) = BV_(dw) + dw. The part I don’t understand is that he states that this leads to V_(dw) being scaled by a factor of 1/(1-B). How does the omission of (1-B) lead to such scaling?

I would greatly appreciate any clarifications regarding this. Thank you!

Hello @Tommy_Lee,

Cheers,
Raymond

PS: You see that we are effectively having two learning rates composing of two hyperparameters - alpha and beta? They are just two different ways of composing these two learning rates :wink:

2 Likes