However, I have a feeling that this is not the true cost function formula. Because the return value here seems to be the sum of the losses for each training data, not the average.
Is there a division by minibatch size built into this function somewhere?
In the cell defining the model() function in 3.3 - Train the Model, there is a line like this:
epoch_cost /= m
The “epoch_cost” calculated here is the sum of all the costs for each mini-batch. For example, if the number of training data is 1024 and the mini-batch size is 64, this epoch_cost is 16 times the cost calculated per mini-batch. In each mini-batch, the cost is calculated as “(the sum of losses per datum) / 64”. So the average cost per epoch should be “(the sum of losses per datum) / 1024”, i.e. “The sum of the costs per mini batch/ 16”.
But that is not the case with epoch_cost /= m.
I believe this ‘m’ is defined as the number of elements in the entire training data. So, in the example above, I’m wondering if I’m calculating “(the sum of the costs in each mini-batch) / 1024”. In this case, I think that the cost will be calculated smaller than expected.
If the return value “cost” in the first question indicates the sum of the losses for each training data, then this notebook happens to calculate the epoch_cost correctly. (Because minibatch_cost is defined to represent the sum of the losses of each training data in mini batch, not the cost as defined as usual.)
However, I am not very satisfied with this notebook configuration. Therefore, I am not at all confident in my idea. Please let me know if there are any misunderstandings.
I read that and understood that the definition of the cost function was changed to better describe the cost per epoch.
However, I still have a question.
If the cost function is defined as the sum of the losses per data, the gradient will be the minibatch size times the error backpropagation result in the original loss function. In other words, the amount of parameter updates has become much larger than before. Is this okay? Is there no problem because the learning rate is small?
It’s a good question, but it all works out. If we redefine the cost per batch to be the sum rather than the mean, then we can still get the same overall Epoch cost by simply waiting to divide by m until we’ve completed the full epoch. The reason for doing it that way is that if you take the average at the minibatch level, then it doesn’t work to take the “average of the averages” or the “sum of the averages” to get the full epoch average cost, in the case that the minibatch size does not evenly divide the full training set size. If all the batches are the same size, then the “average of the averages” would work. But the point is that solution is not general. Take a look at the way they handle the costs in the Optimization Assignment in C2 Week 2 to see this method of computing the average cost in action.
But even if we changed the definition of the full cost function to be the sum instead of the average, it would make no difference in the result of back propagation. If you minimize J you get the same values for the parameters that you do if you minimize m * J, where m is a constant, right? That may seem a little counterintuitive, since the gradients are derivatives of J, so they will be larger values. But as you say, it may require adjusting the learning rate to make sure the convergence still works: with large gradients, there is a bigger chance of oscillation or divergence. But the fundamental point is that multiplying the cost by a constant doesn’t really change the shape of the cost surface and where the local minima are in the parameter space.