Above is the formula for gradient descent with momentum. However, it is different from the exponentially moving average formula which confuses me. In the formula in the image above, it multiplies beta by the vdw which is the value of the gradient in the current layer and not the previous layer.

I would greatly appreciate any help!

Hi @yeoh_zhewei

The exponentially moving average use to decrease the oscillation by using for example this formula X(new)=\beta X(old) + (1-\beta)X(calculated) and we assume that the beta is 0.8 that mean that new X consists of 0.8 from the old X and only 0.2 of the calculated X ,

In the other word that new X is strongly related with the old X by 80% , that used in the time series to make the line more smooth and decrease oscillation in the line of this graph

,

in this image

the V_{dw}(new)=\beta V_{dw}(old) + (1-\beta)dw(calculated), so that V_{dw} is the value of the gradient of the current layer but it dependent on the last values V_{dw}(old) calculated by rate of \beta and dependent on the new calculated value dw(calculated) with rate of (1-\beta), you can specify how many values the new V_{dw} is dependent with V_{dw}(old) by this equation \frac{1}{(1-\beta)} for example if \beta is 0.8 and the equation is V_{dw}(new)=\beta V_{dw}(old) + (1-\beta)dw(calculated) that mean the V_{dw}(new) is related with the last five calculated values of V_{dw}(old), it used for decrease the oscillation when we calculate the gradient decent values

Regards,

Abdelrahman