The definition of the cost J is that it is a scalar value: the average of the individual loss values on all the samples in the current full pass through the dataset. But the point of this particular function we are building here is that it is the sum of the losses, not the average. The reason is that this function is the lower level function that we use on each minibatch. Then we add all those cost values up and only divide by the number of samples when we get to the end of the full epoch (all the minibatches).
The reason for doing it this way is that the math doesn’t work of you take the average at the minibatch level. It would if all the minibatches were the same size, but that is not guaranteed, right? That condition will not be met when the minibatch size does not evenly divide the full training set size. Think about it for a minute and that should make sense. That is the reason that they structured the cost computation this way. If you missed this before, have another careful look at how this was done in the C2 W2 Optimization Assignment (the immediately previous assignment), which is where we really implemented minibatch for the first time. What we are doing here mirrors how it was done there, but we’re using TF this time around.
So in other words when I said to use reduce_sum
instead of reduce_mean
, I didn’t mean to then divide by the number of samples at this level. That happens in the higher level loop over the minibatches.