I found this comment:

I read that and understood that the definition of the cost function was changed to better describe the cost per epoch.

However, I still have a question.

If the cost function is defined as the sum of the losses per data, the gradient will be the minibatch size times the error backpropagation result in the original loss function. In other words, the amount of parameter updates has become much larger than before. Is this okay? Is there no problem because the learning rate is small?