I think there is a problem with this exercise
here is the code which is the correct answer (I think it’s not correct)
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate = 0.01,
beta1 = 0.9, beta2 = 0.999, epsilon = 1e-8):
"""
Update parameters using Adam
Arguments:
parameters -- python dictionary containing your parameters:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients for each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
t -- Adam variable, counts the number of taken steps
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing your updated parameters
v -- Adam variable, moving average of the first gradient, python dictionary
s -- Adam variable, moving average of the squared gradient, python dictionary
"""
L = len(parameters) // 2 # number of layers in the neural networks
v_corrected = {} # Initializing first moment estimate, python dictionary
s_corrected = {} # Initializing second moment estimate, python dictionary
# Perform Adam update on all parameters
for l in range(1, L + 1):
# Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
# (approx. 2 lines)
# v["dW" + str(l)] = ...
# v["db" + str(l)] = ...
# YOUR CODE STARTS HERE
v["dW" + str(l)] = beta1 * v["dW" + str(l)] + (1 - beta1) * grads["dW" + str(l)]
v["db" + str(l)] = beta1 * v["db" + str(l)] + (1 - beta1) * grads["db" + str(l)]
# YOUR CODE ENDS HERE
# Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
# (approx. 2 lines)
# v_corrected["dW" + str(l)] = ...
# v_corrected["db" + str(l)] = ...
# YOUR CODE STARTS HERE
v_corrected["dW" + str(l)] = v["dW" + str(l)] / (1 - np.power(beta1, t))
v_corrected["db" + str(l)] = v["db" + str(l)] / (1 - np.power(beta1, t))
# YOUR CODE ENDS HERE
# Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
#(approx. 2 lines)
# s["dW" + str(l)] = ...
# s["db" + str(l)] = ...
# YOUR CODE STARTS HERE
s["dW" + str(l)] = beta2 * s["dW" + str(l)] + (1 - beta2) * np.power(grads["dW" + str(l)], 2)
s["db" + str(l)] = beta2 * s["db" + str(l)] + (1 - beta2) * np.power(grads["db" + str(l)], 2)
# YOUR CODE ENDS HERE
# Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
# (approx. 2 lines)
# s_corrected["dW" + str(l)] = ...
# s_corrected["db" + str(l)] = ...
# YOUR CODE STARTS HERE
s_corrected["dW" + str(l)] = s["dW" + str(l)] / (1 - np.power(beta2, t))
s_corrected["db" + str(l)] = s["db" + str(l)] / (1 - np.power(beta2, t))
# YOUR CODE ENDS HERE
# Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
# (approx. 2 lines)
# parameters["W" + str(l)] = ...
# parameters["b" + str(l)] = ...
# YOUR CODE STARTS HERE
parameters["W" + str(l)] = parameters["W" + str(l)] - learning_rate * (v_corrected["dW" + str(l)] / (np.sqrt(s_corrected["dW" + str(l)]) + epsilon))
parameters["b" + str(l)] = parameters["b" + str(l)] - learning_rate * (v_corrected["db" + str(l)] / (np.sqrt(s_corrected["db" + str(l)]) + epsilon))
# YOUR CODE ENDS HERE
return parameters, v, s, v_corrected, s_corrected
notice that we’re calculating v_corrected by v_corrected["dW" + str(l)] = v["dW" + str(l)] / (1 - np.power(beta1, t))
which “t” is the iteration number and it should be from 1 to num_iterations , however we’re using a fix number, so instead of using “t” in our code for the v_corrected and s_corrected, beta1 and beta2 should have the power of “l” which is the iteration number.
is that correct?