DLS Course 2,Week2,Programming Assignment(Exercise 3 and Exercise 5)

Hi, Please I need to confirm a notion because it seems to be a little contradictory. It is about the oscillation of the gradient descent algorithm when trying to find the minimum. From my understanding from optimization courses, I know that the gradients of the parameters are normal to one other thus causing a change in direction and making Gradient descent steps towards the optimum a less efficient. But in the programming exercise introduction, the argument was, I quote(as stated in the programming assignment text) “In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly” followed by diagrams where SGD is oscillating heavily and GD is not. So my question is, Is the oscillation as a result of taking steps towards the minimum with small batches of training examples or does using SGD(small batches) basically just makes the oscillation worst/ more noisy? Because if the answer is the former, then there wouldn’t be a need to use momentum or Adam optimizers for batch GD but I quote again(as stated in the programming assignment texts) " * Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent." From this statement, which is what I expected it to be , it is saying that using these optimizers ensures that past gradients are taken into consideration so even with batch GD, we still want to take into consideration past gradients so as to have a smoother GD, i.e GD by itself has oscillations and using Stochastic GD just makes the oscillations worse.

Question 2: Programming Assignment Exercise 3 and Exercise 5: We are requested to put the codes for initializing the Momentum and Adam parameters, I found a bug in my code but I was hoping you could explain to me why one worked and the other did not because when I ran both on the test I passed but when I tried to use in the model to train my NN, I kept getting key error…
Code1: Taking Momentum parameter initialization as case study

v[“dW” + str(l+1)] = np.zeros((np.shape(grads[“W” + str(l+1)])))

v[“db” + str(l+1)] = np.zeros((np.shape(grads[“b” + str(l+1)])))

Code 2:
v[“dW” + str(l+1)] = np.zeros((parameters[“W” + str(l+1)].shape[0], parameters[“W” + str(l+1)].shape[1]))
v[“db” + str(l+1)] = np.zeros((parameters[“b” + str(l+1)].shape[0], parameters[“b” + str(l+1)].shape[1]))

The code 2 worked while code 1 didn’t work but both were tested with the test case and they gave same results.
Thanks.

The problem with the first code is that np.shape() is a tuple, probably with two elements in this case. But you then enclose it in parentheses, so that makes it a tuple with one element, which is itself a tuple (a tuple of tuples, instead of a tuple of integers). I have not read all the beginning part of your post yet.

For the questions about gradient descent, the point is that the gradients are averaged over all the samples in whatever the batch size is that you are using. So the smaller the batch size (1 in the limit case of SGD), the more statistical noise there is in the process. So momentum methods will mitigate that with whatever batch size you are using. The general point is that there is never any guarantee of smooth monotonic convergence, even with Full Batch GD.

Oh , Thanks Paulin. so if I had used the shape method rather than the function, I shouldn’t have the issue. I will try that and get back to you.

Oh thanks for the clarification. I think I understand it better now.