C5W2A2 Exercise 2 - model. Aren't we updating wrong the parameters?

hi,

Going through the assignment, the model function in the inner loop updates the parameters W and b, but it is cumulating the changes in dW and db that are required!

    for t in range(num_iterations): # Loop over the number of iterations
        
        cost = 0
        dW = 0
        db = 0
        
        for i in range(m):          # Loop over the training examples
            ...
            ...
            ...

            # Compute gradients 
            dz = a - Y_oh[i]
            dW += np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))
            db += dz

            # Update parameters with Stochastic Gradient Descent
            W = W - learning_rate * dW
            b = b - learning_rate * db

It feels like either we should

  • leave the updates as they are but use dW = and db= insetad of += that cumulates the changes (effectively becoming SGD), or;
  • cumulate the changes (so dx += as it is) and move the updates of W and b to the outer loop (which then is like BGD).

Could it be that this is wrong?

Thanks!

Hi, Julian.

This is a really good point that I’ve never noticed until you mentioned it. They say in the comments that the intent is to do Stochastic Gradient Descent, meaning that we would update the parameters after computing the gradients on each input training sample individually. But then you’re right that we should not be accumulating the dW and db gradient values in that case: they should just be used on each iteration separately.

But then it’s surprising that the convergence still works as well as it does. Let me try it without the += and see what happens.

Thanks for pointing this out!

Interesting. I tried the experiment of doing what I think we agree is the correct SGD solution and the convergence is worse than with the code that they gave us. I can pass the tests in the notebook and the grader with either version of the code. But with the given code with the += accumulated gradients, it gets to 100% training accuracy before 400 iterations. With the non-accumulated pure SGD, if I run the training for 600 iterations it plateaus at just over 97% accuracy at 400 iterations.

More thought required here to figure out if we’re just missing something in their intent here.

hi Paul, thanks for your answers and for testing that. While that is true, I actually see that the accuracy on the test set went from 89% to 91% (at least with 1K iterations), so better generalization maybe? EDIT: actually I see that at 400 iters, the test accuracy was already 91% for the += implementation hehe