Course video: Week #2 (Link: https://www.coursera.org/learn/deep-neural-network/lecture/y0m1f/gradient-descent-with-momentum)
Hello, I wanted some clarifications regarding the omission of (1-B) when implementing gradient descent with momentum. Using V_(dw) as an example, the original equation is V_(dw) = BV_(dw) + (1-B)dw. Dr. Ng says that when implementing it, we can omit the (1-B) so the equation becomes V_(dw) = BV_(dw) + dw. The part I don’t understand is that he states that this leads to V_(dw) being scaled by a factor of 1/(1-B). How does the omission of (1-B) lead to such scaling?
I would greatly appreciate any clarifications regarding this. Thank you!