In the implementation stage of momentum gradient descent

VdW = BVdW + (1 -B) * dW

is the BVdW on the right side of the equation referring to the previous VdW. Because from the last couple videos we learned about implementing exponentially weighted averages, where the equation for V refers to the previous v, is this the same here?

In my head, I was wondering what is the point of BVdW because B is a constant and it does not seem to rely on previous values and you might as well as change the learning rate.

if I just look at your equation, anything that looks like this x = b * x + (1-b) * y is a process of doing an exponential moving average (EMA) that updates x with y. The EMA is like a weighted sum, so if b = 0.85, then 85% of the next x comes from the current x and the rest 1-b =15\% from the new y value.

As b is higher, such weighted sum makes x to change slower to the new value y, because a higher b preserves more of the current x. Such ability of “preserving” makes it momentum-like.

If you still have questions or you think my answer isn’t what you are looking for, let me know, and also it would be great if you can share the reference to where you saw the equation, which video or which assignment?

Okay, if you are to look at 3 minutes of the video. I am confused why is it that Vdw = Vdw * B + (1-B) * Dw and not V(dw-1) if you are accounting the previous couple of videos

So I think this is the equation you are talking about:

I believe I have explained why the equation looks the way it is in my first reply.

Can you explain your logic for V_{dw} = V (dw - 1)? I need to know how you connect what information from what previous videos in what way to see your idea, and perhaps there is something we all agree and maybe there is something we want to further discuss.

This ( $V_{dw} = V (dw - 1)$ ) is how I write down this equation ( V_{dw} = V (dw - 1) ) in latex, and perhaps you want to use latex to present your idea?