Question Regarding Scaling of V_(dw) and V_(db)

Tommy_Lee · February 5, 2024, 5:53am

Course video: Week #2 (Link: https://www.coursera.org/learn/deep-neural-network/lecture/y0m1f/gradient-descent-with-momentum)

Hello, I wanted some clarifications regarding the omission of (1-B) when implementing gradient descent with momentum. Using V_(dw) as an example, the original equation is V_(dw) = BV_(dw) + (1-B)dw. Dr. Ng says that when implementing it, we can omit the (1-B) so the equation becomes V_(dw) = BV_(dw) + dw. The part I don’t understand is that he states that this leads to V_(dw) being scaled by a factor of 1/(1-B). How does the omission of (1-B) lead to such scaling?

I would greatly appreciate any clarifications regarding this. Thank you!

rmwkwok · February 5, 2024, 6:11am

Hello @Tommy_Lee,

Cheers,
Raymond

PS: You see that we are effectively having two learning rates composing of two hyperparameters - alpha and beta? They are just two different ways of composing these two learning rates

Topic		Replies	Views
Gradient descent with Momentum - Week 2 Improving Deep Neural Networks: Hyperparameter tun week-2 , coursera-platform	6	237	May 22, 2024
DLS 2 Week 2: Gradient Descent with Momentum "simplification" Improving Deep Neural Networks: Hyperparameter tun week-2 , coursera-platform	2	48	October 1, 2024
Course 2, Week 2, suggest for Gradient Descent with Momentum Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	294	November 14, 2023
Momentum Gradient Descent question Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	622	December 23, 2022
Course 2, week 2, update_parameters_with_momentum() issue Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	514	November 24, 2021

Question Regarding Scaling of V_(dw) and V_(db)

Related topics