Gradient descent with Momentum - Week 2

#Week2
#Grandient descent with Momentum

Hi,
I have a question regarding gradient descent with momentum, please.

In the video, it says that in some research papers, instead of writing the equation on the left, they write the equation on the right: V_dw = B*v_dw + dW. We have this equation because we do *1/(1-B) on the equation on the left and we find the equation on the right.
I do not understand why we don’t have : 1/(1-B)*V_dw = 1/(1-B)*v_dw + dW, please ?

Secondly, our V_dw is being scale by 1/(1-B). So, when we performing gradient decent update, we need to change the alpha by 1/(1-B).
I do not understand because if we use this equation above: V_dw = Bv_dw + dW. and we change alpha by 1/(1-B) - for this last term: dW we already do 1/(1-B) (as in the first image) but do it 1*/(1-B) a second time with the alpha…

Can someone explain me the intuition, please ?
I hope I was clear in my explanation :slight_smile:
Thank you,
Kind regards,
Sao Mai

Is this thread related to your question?

1 Like

Hi,
Thank you very much for your reply :slight_smile: !
I fact, not really.
My question was: The formula we learnt in the videos for the gradient decent with Momentum is: v_dw = Betav_dw + (1-Beta)dW.
However, in some literature we often see: v_dw = Beta
v_dw + dW. With the (1-Beta) ommited. I wanted to know, please how we end up with this formula : v_dw = Beta
v_dw + dW ?
Secondly, by using this formula v_dw = Betav_dw + dW, how do we need to update our parameter W, please ? (instead of W = W - av_dw, what should we use ?)

I would like to understand the mechanisms behind, please.
Thank you very much
Sao Mai

Sorry, I do not know.

1 Like

No worries! Thank you so much for you reply :smiling_face:

We are not deriving the formula on the right-hand side (v_{dw} = \beta v_{dw} + dW) from the formula written on the left-hand side (v_{dw} = \beta v_{dw} + (1 - \beta) dW). These are two different formulas for momentum to control vertical movement and as Prof. Andrew said, “both of these will work just fine”, though he also mentioned some limitations of the right-hand side formula.

Its the same as for other methods: W = W - \alpha v_{dw} and b = b - \alpha v_{db}

1 Like

I see we are not deriving the formula on the right-hand side from the formula written on the left-hand side.
Thank you so much for your explaination, everything is clear :slight_smile: