Hi, Please I need to confirm a notion because it seems to be a little contradictory. It is about the oscillation of the gradient descent algorithm when trying to find the minimum. From my understanding from optimization courses, I know that the gradients of the parameters are normal to one other thus causing a change in direction and making Gradient descent steps towards the optimum a less efficient. But in the programming exercise introduction, the argument was, I quote(as stated in the programming assignment text) “In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly” followed by diagrams where SGD is oscillating heavily and GD is not. So my question is, Is the oscillation as a result of taking steps towards the minimum with small batches of training examples or does using SGD(small batches) basically just makes the oscillation worst/ more noisy? Because if the answer is the former, then there wouldn’t be a need to use momentum or Adam optimizers for batch GD but I quote again(as stated in the programming assignment texts) " * Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent." From this statement, which is what I expected it to be , it is saying that using these optimizers ensures that past gradients are taken into consideration so even with batch GD, we still want to take into consideration past gradients so as to have a smoother GD, i.e GD by itself has oscillations and using Stochastic GD just makes the oscillations worse.
Question 2: Programming Assignment Exercise 3 and Exercise 5: We are requested to put the codes for initializing the Momentum and Adam parameters, I found a bug in my code but I was hoping you could explain to me why one worked and the other did not because when I ran both on the test I passed but when I tried to use in the model to train my NN, I kept getting key error…
Code1: Taking Momentum parameter initialization as case study
v[“dW” + str(l+1)] = np.zeros((np.shape(grads[“W” + str(l+1)])))
v[“db” + str(l+1)] = np.zeros((np.shape(grads[“b” + str(l+1)])))
Code 2:
v[“dW” + str(l+1)] = np.zeros((parameters[“W” + str(l+1)].shape[0], parameters[“W” + str(l+1)].shape[1]))
v[“db” + str(l+1)] = np.zeros((parameters[“b” + str(l+1)].shape[0], parameters[“b” + str(l+1)].shape[1]))
The code 2 worked while code 1 didn’t work but both were tested with the test case and they gave same results.
Thanks.