# Week2 - assignment 1 - ex6

I think there is a problem with this exercise
here is the code which is the correct answer (I think it’s not correct)

``````def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
beta1 = 0.9, beta2 = 0.999,  epsilon = 1e-8):
"""

Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
t -- Adam variable, counts the number of taken steps
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates

Returns:
parameters -- python dictionary containing your updated parameters
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
"""

L = len(parameters) // 2                 # number of layers in the neural networks
v_corrected = {}                         # Initializing first moment estimate, python dictionary
s_corrected = {}                         # Initializing second moment estimate, python dictionary

# Perform Adam update on all parameters
for l in range(1, L + 1):
# Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
# (approx. 2 lines)
# v["dW" + str(l)] = ...
# v["db" + str(l)] = ...
v["dW" + str(l)] = beta1 * v["dW" + str(l)] + (1 - beta1) * grads["dW" + str(l)]
v["db" + str(l)] = beta1 * v["db" + str(l)] + (1 - beta1) * grads["db" + str(l)]

# Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
# (approx. 2 lines)
# v_corrected["dW" + str(l)] = ...
# v_corrected["db" + str(l)] = ...
v_corrected["dW" + str(l)] = v["dW" + str(l)] / (1 - np.power(beta1, t))
v_corrected["db" + str(l)] = v["db" + str(l)] / (1 - np.power(beta1, t))

# Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
#(approx. 2 lines)
# s["dW" + str(l)] = ...
# s["db" + str(l)] = ...
s["dW" + str(l)] = beta2 * s["dW" + str(l)] + (1 - beta2) * np.power(grads["dW" + str(l)], 2)
s["db" + str(l)] = beta2 * s["db" + str(l)] + (1 - beta2) * np.power(grads["db" + str(l)], 2)

# Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
# (approx. 2 lines)
# s_corrected["dW" + str(l)] = ...
# s_corrected["db" + str(l)] = ...
s_corrected["dW" + str(l)] = s["dW" + str(l)] / (1 - np.power(beta2, t))
s_corrected["db" + str(l)] = s["db" + str(l)] / (1 - np.power(beta2, t))

# Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
# (approx. 2 lines)
# parameters["W" + str(l)] = ...
# parameters["b" + str(l)] = ...
parameters["W" + str(l)] = parameters["W" + str(l)] - learning_rate * (v_corrected["dW" + str(l)] / (np.sqrt(s_corrected["dW" + str(l)]) + epsilon))
parameters["b" + str(l)] = parameters["b" + str(l)] - learning_rate * (v_corrected["db" + str(l)] / (np.sqrt(s_corrected["db" + str(l)]) + epsilon))

return parameters, v, s, v_corrected, s_corrected
``````

notice that we’re calculating v_corrected by `v_corrected["dW" + str(l)] = v["dW" + str(l)] / (1 - np.power(beta1, t))`
which “t” is the iteration number and it should be from 1 to num_iterations , however we’re using a fix number, so instead of using “t” in our code for the v_corrected and s_corrected, beta1 and beta2 should have the power of “l” which is the iteration number.
is that correct?

1 Like

I think your code looks correct. Your last statement is incorrect: t is an argument to the function. l is the number of the layer of the network. The “for” loop there is over all the layers of the network: you need to update all the weights on each iteration, right?

Are you saying that this code fails the tests in the notebook or fails the grader?

I mean that here

``````v_corrected["dW" + str(l)] = v["dW" + str(l)] / (1 - np.power(beta1, t))
``````

instead of using “t” which is a constant number over iteration we should use “l” , like the code below

``````v_corrected["dW" + str(l)] = v["dW" + str(l)] / (1 - np.power(beta1, l))
``````

so now we are changing the power of beta over the iterations, as Andrew said before.

Yes, I understood what you are saying and my response was that is incorrect. The point is that the “for” loop in the update parameters routine is over the layers of the network, not over the iterations of gradient descent. The “update” routine is called once per iteration by the higher level logic. Look at the logic in the model function later in the notebook, which is where the update routines are called. Also look at the formulas as shown in the notebook.