Why take cost average in Gradient Descent?

In the exercise, Optimization methods, the implementation for GD and SGD takes the cost average at the very end. I don’t believe we’ve done this before in other notebooks. Why does this notebook take the average?

The cost is normally defined as the average of the loss function values across all the samples in the Epoch. So they have written the compute_cost utility function that they gave you here (you can find the source by clicking “File → Open” and then opening the appropriate python file) so that it returns the sum of the costs across the current batch of samples. With that design, then they can use that same subroutine in all three different cases: Full Batch Gradient Descent, MiniBatch GD or Stochastic GD. In the latter two cases, you need to keep the running sum of the costs across all the minibatches.

1 Like

Note that there is a more complicated way to deal with this issue: you could have compute_cost compute the average over whatever inputs it is given. Then add those up in the outer loop. Then at the end divide by the number of minibatches and you get the same result: the average over the full batch. There’s just one little problem: that math doesn’t work if the minibatch size does not evenly divide the total batch size.

So there are really two problems with that approach: even in the case that the minibatch size evenly divides the full batch, it’s just too complicated and you have to think hard to convince yourself it works. And then there’s the problem that it doesn’t work in all cases.

The way they did it is simple and clearly correct in all cases.

1 Like