Here is the formula for momentum gradient descent in the week assignment which passed the test:
on itiration of l from 1 to L+1
v[“dW”+str(l)]=betav[“dW”+str(l)]+(1-beta)grads[“dW”+str(l)]
v[“db”+str(l)]=betav[“db”+str(l)]+(1-beta)grads[“db”+str(l)]
parameters[“W”+str(l)]=parameters[“W”+str(l)]-learning_ratev[“dW”+str(l)]
parameters[“b”+str(l)]=parameters[“b”+str(l)]-learning_ratev[“db”+str(l)]
And all the vdW1, vdW2… are all set to zero initially.
what I find by itirating l is that v does not take into account the previous v, as opposed to what Andrew taught in his class.
vdW1 will simply be updated by a zero vector plus dW1alpha
and vdW2 will be also updated by a zero vector plus dW2alpha
these is no link to v1 and v2.
I really think the formula should change so that the vdW2 will be VdW1-dW1alpha rather than vdW2-dW1alpha as shown in Andrew’s lecture and also the assignment.
Pls teach me if I deduced wrongly
And all the vdW1, vdW2… are all set to zero initially.
If you take a look at the model() function, you will see the optimizer is initialized to zero only once at the start of the optimize loop, and it is updated per minibatch by calling the
update_parameters_with_momentum() function. This function makes adjustment to the parameters and v at each layer, and is called a number of times with a different set of v until all minibatches are done.
I really think the formula should change so that the vdW2 will be VdW1-dW1alpha
Why do you think the formula should be changed?
vdW2- dW1 alpha
dW[1] should refer to dW of layer l (lower case of L), not 1