A doubt in Week 3 Assignment

Dear Administrator,

Could you please guide me the following issue?

I found that in the “Section 3.3 - Train the Model” in “Week 3 Assignment - Tensorflow_introduction”, a total loss is being used to calculate the gradients instead of mean loss which I learned in the previous lectures. May i know whether this approach quoted from the assignment is correct?

*I use pseudocode instead to prevent breach of rules

For the minibatch_X with minibatch_Y in minibatches:
        
   Name tf.GradientTape() as tape:

        Z3 <- Call forward propagation function

        minibatch_total_loss <- compute **total loss**
  
   Define trainable_variables as W1, b1, W2, b2, W3, b3

   grads <- Call tape.gradient function, 
            passing minibatch_total_loss and 
            trainable_variables as arguments

   Update parameters by using optimizer

Thank you.

Calculating the gradient based on total loss is incorrect. As you observed, mean is correct.

Please wait while I get more information from other mentors / staff regarding this.

The reason that compute_total_loss uses the sum instead of the mean is that it is used on the individual minibatches. We want the overall average cost for the whole epoch, but we can’t get that by taking the average at the minibatch level: the math doesn’t work unless all the minibatches are the same size, which might not be true, right? So we compute the running sum and then finally compute the mean at the end of the epoch.

        # We divide the epoch total loss over the number of samples
        epoch_total_loss /= m

But you’re right that the code is using the gradients computed relative to the sum at the minibatch level. But that just means that you’re scaling the gradients up by the scalar factor of {m_b} where m_b is the minibatch size. They are still vectors that point in the same direction, so we might need to tweak the learning rate to be lower, but the result is the same. If you minimize J, you’ve also minimized \frac {1}{m} * J, right? In other words, the solution we end up finding by Gradient Descent should be the same.