What they say is correct if you think carefully about what is being said. We only divide by the total number of samples at the end of one full pass of training (all the minibatches). But the function we are writing here is computing the cost for one minibatch, so we only take the sum. The higher level logic will compute the running sum across all the minibatches and then compute the average when it is finished with the pass. You can’t compute the average at the minibatch level, because the math doesn’t work if all the minibatches are not the same size. That will happen if the minibatch size does not evenly divide the total batch size. So you can’t get the overall average by taking the average of the averages.

If you were paying close attention, this is exactly how it worked when we first implemented minibatch gradient descent in the previous assignment (C2 W2 A1 Optimization). It’s the same here, but now we’re doing it in TF instead of straight numpy.