Gradient Descent with Momentum (formula)

zheneliz147 · November 13, 2022, 9:29am

Could you help me to understand it,please? In videos it was said that we calculate Vdw as Vdw=beta*Vdw+(1-beta)*dW. Why we use here Vdw and not Vdw-1? Also, if we initialize it with zeros, than the first term will be zero at any layer and it won’t affect on the value of Vdw.

Thank you for your help in advance!

paulinpaloalto · November 13, 2022, 4:10pm

This is a variation on the idea of an EWA (exponentially weighted average). Not exactly the same, but the same “flavor”. Notice that Vdw is only zero on the first iteration. They could have done the analog of the “bias correction” method for EWAs to change this, but decided not to.

davidg · November 13, 2022, 4:59pm

I think the formula as you stated it lacks some notational clarity, I would write it as

VdW_{t} = beta * VdW_{t-1} + (1 - beta) * dW_{t}

(Note the indices on VdW and dW). In other words, even if VdW is initialised to 0, its contribution will only vanish on the first iteration, and VdW_{1} = (1 - beta) * dW_{1}.
VdW_{t-1} != 0 for all t >= 1

Topic		Replies	Views
HELP - Something not clear with momentum gradient decent Improving Deep Neural Networks: Hyperparameter tun	5	386	August 11, 2023
Course 2, Week 2, suggest for Gradient Descent with Momentum Improving Deep Neural Networks: Hyperparameter tun	1	292	November 14, 2023
Confused on Exponentially Weighted Average Videos Improving Deep Neural Networks: Hyperparameter tun	4	417	August 19, 2023
Momentum Gradient Descent question Improving Deep Neural Networks: Hyperparameter tun	5	617	December 23, 2022
Momentum Formula Improving Deep Neural Networks: Hyperparameter tun week-2	5	210	May 18, 2024

Gradient Descent with Momentum (formula)

Related topics