Momentum Formula

Yifei1 · May 18, 2024, 9:21am

Here is the formula for momentum gradient descent in the week assignment which passed the test:
on itiration of l from 1 to L+1
v[“dW”+str(l)]=betav[“dW”+str(l)]+(1-beta)grads[“dW”+str(l)]
v[“db”+str(l)]=betav[“db”+str(l)]+(1-beta)grads[“db”+str(l)]
parameters[“W”+str(l)]=parameters[“W”+str(l)]-learning_ratev[“dW”+str(l)]
parameters[“b”+str(l)]=parameters[“b”+str(l)]-learning_ratev[“db”+str(l)]

And all the vdW1, vdW2… are all set to zero initially.
what I find by itirating l is that v does not take into account the previous v, as opposed to what Andrew taught in his class.
vdW1 will simply be updated by a zero vector plus dW1alpha
and vdW2 will be also updated by a zero vector plus dW2alpha
these is no link to v1 and v2.
I really think the formula should change so that the vdW2 will be VdW1-dW1alpha rather than vdW2-dW1alpha as shown in Andrew’s lecture and also the assignment.
Pls teach me if I deduced wrongly

Yifei1 · May 18, 2024, 9:23am

either my deduction is wrong or the lecture is taught wrongly. Pls enlighten me!!!

Nevermnd · May 18, 2024, 10:31am

@Yifei1 Keep in mind you have a couple (small-- ‘Python-wise’) math errors here.

But keep in mind we are updating our parameters, so the

["dW" + str(l)]

that goes in is not the same as the

["dW" + str(l)] =

that comes out-- thus the first is effectively reflective of the gradient at the ‘time step before’ the present one, or V^{T-1}

Kic · May 18, 2024, 1:11pm

Hi @Yifei1 ,

And all the vdW1, vdW2… are all set to zero initially.

If you take a look at the model() function, you will see the optimizer is initialized to zero only once at the start of the optimize loop, and it is updated per minibatch by calling the
update_parameters_with_momentum() function. This function makes adjustment to the parameters and v at each layer, and is called a number of times with a different set of v until all minibatches are done.

I really think the formula should change so that the vdW2 will be VdW1-dW1alpha

Why do you think the formula should be changed?

vdW2- dW1 alpha
dW[1] should refer to dW of layer l (lower case of L), not 1

Yifei1 · May 18, 2024, 3:07pm

oh thank you for clarifying that momentum is updated batch by batch. I was thinking it as layer by layer

Yifei1 · May 18, 2024, 3:07pm

thanks for clarification. I understand now

Topic		Replies	Views
Gradient descent with Momentum - Week 2 Improving Deep Neural Networks: Hyperparameter tun week-2	6	237	May 22, 2024
Course 2, week 2, update_parameters_with_momentum() issue Improving Deep Neural Networks: Hyperparameter tun	2	514	November 24, 2021
Course 2, Week 2, suggest for Gradient Descent with Momentum Improving Deep Neural Networks: Hyperparameter tun	1	293	November 14, 2023
Error in Programming Assignment: Optimization Methods Improving Deep Neural Networks: Hyperparameter tun week-2	3	172	May 19, 2024
Gradient Descent with Momentum (formula) Improving Deep Neural Networks: Hyperparameter tun	2	522	November 13, 2022

Momentum Formula

Related topics