HELP - Something not clear with momentum gradient decent


Hello,
In the video explaining momentum he showed the following formulas and i dont understand why are they correct.
You initialise vdw to 0, and than compute vdw using beta*vdw + (1-beta)*dW.
I dont understand why we multiply the whole derivative of of W? dont we suppose to use the derivative of the last 10?

thank you.

We aren’t using “the whole derivative of W”: we’re multiplying it by (1 - \beta) on each iteration. That was all explained in the lectures: the effect of primarily depending on the recent values is what Exponentially Weighted Averages do for you. That’s the formula with \beta and (1 - \beta) as the factors there. Prof Ng devotes several lectures to explaining how EWAs work and how to apply them for purposes like Momentum here.

I dont think you understood me.
My question is why were using dW and not dW[:,i:i+1] for example.

dW is not indexed by iteration, right? The dimensions are neurons out by neurons in. The point about the influence of recent iterations is handled by the EWA algorithm as you iterate over the loops. That is the purpose of \beta in the formula you show. I may still be missing your point, but I really think you should watch all the lectures about EWA again.

I’m referring to the lecture Understanding Exponentially Weighted Averages in DLS C2 Week 2. That exactly addresses my interpretation of the question you’re asking.

Ill check it out again and update, thanks!