Momentum Gradient Descent question

hazingo · December 21, 2022, 12:07am

In the implementation stage of momentum gradient descent

VdW = BVdW + (1 -B) * dW

is the BVdW on the right side of the equation referring to the previous VdW. Because from the last couple videos we learned about implementing exponentially weighted averages, where the equation for V refers to the previous v, is this the same here?

In my head, I was wondering what is the point of BVdW because B is a constant and it does not seem to rely on previous values and you might as well as change the learning rate.

Thank you so much,

rmwkwok · December 21, 2022, 1:21am

Hello @hazingo,

if I just look at your equation, anything that looks like this x = b * x + (1-b) * y is a process of doing an exponential moving average (EMA) that updates x with y. The EMA is like a weighted sum, so if b = 0.85, then 85% of the next x comes from the current x and the rest 1-b =15\% from the new y value.

As b is higher, such weighted sum makes x to change slower to the new value y, because a higher b preserves more of the current x. Such ability of “preserving” makes it momentum-like.

If you still have questions or you think my answer isn’t what you are looking for, let me know, and also it would be great if you can share the reference to where you saw the equation, which video or which assignment?

Cheers,
Raymond

hazingo · December 22, 2022, 7:04am

I am referring to the “Momentum gradient scene” on week 2; I’m confused about the implementation.

rmwkwok · December 22, 2022, 11:04am

Can you share the video name and the timestamp of the video which will show what you are confused about? I want to look at what you looked at.

hazingo · December 22, 2022, 11:26am

Okay, if you are to look at 3 minutes of the video. I am confused why is it that Vdw = Vdw * B + (1-B) * Dw and not V(dw-1) if you are accounting the previous couple of videos

Thanks

rmwkwok · December 23, 2022, 1:55am

So I think this is the equation you are talking about:

Screenshot_20221223_095257

I believe I have explained why the equation looks the way it is in my first reply.

Can you explain your logic for V_{dw} = V (dw - 1)? I need to know how you connect what information from what previous videos in what way to see your idea, and perhaps there is something we all agree and maybe there is something we want to further discuss.

This ( $V_{dw} = V (dw - 1)$ ) is how I write down this equation ( V_{dw} = V (dw - 1) ) in latex, and perhaps you want to use latex to present your idea?

Please present your idea in full.

Raymond

Topic		Replies	Views
Implementing exponentially weighted averages Improving Deep Neural Networks: Hyperparameter tun	3	521	April 5, 2023
Gradient descent with Momentum - Week 2 Improving Deep Neural Networks: Hyperparameter tun week-2	6	234	May 22, 2024
HELP - Something not clear with momentum gradient decent Improving Deep Neural Networks: Hyperparameter tun	5	386	August 11, 2023
Confused on Exponentially Weighted Average Videos Improving Deep Neural Networks: Hyperparameter tun	4	417	August 19, 2023
Course 2, Week 2, suggest for Gradient Descent with Momentum Improving Deep Neural Networks: Hyperparameter tun	1	292	November 14, 2023

Momentum Gradient Descent question

Related topics