Gradient Descent with Momentum (formula)

Could you help me to understand it,please? In videos it was said that we calculate Vdw as Vdw=beta*Vdw+(1-beta)*dW. Why we use here Vdw and not Vdw-1? Also, if we initialize it with zeros, than the first term will be zero at any layer and it won’t affect on the value of Vdw.

Thank you for your help in advance!

This is a variation on the idea of an EWA (exponentially weighted average). Not exactly the same, but the same “flavor”. Notice that Vdw is only zero on the first iteration. They could have done the analog of the “bias correction” method for EWAs to change this, but decided not to.

I think the formula as you stated it lacks some notational clarity, I would write it as

VdW_{t} = beta * VdW_{t-1} + (1 - beta) * dW_{t}

(Note the indices on VdW and dW). In other words, even if VdW is initialised to 0, its contribution will only vanish on the first iteration, and VdW_{1} = (1 - beta) * dW_{1}.
VdW_{t-1} != 0 for all t >= 1