Mini-batch gradient descent decreasing


I don’t understand why the cost in the second video of the DLS course 2 should decrease even with oscillations.

The figure is the dependence of the cost function on the number of mini-batches. I think the curve will decrease if we plot, for example, the average cost functions of mini-batches as a function of the number of iterations.

I am not sure if I am right though.


let’s take these 4 examples:

  1. slope of tangent line will be always be positive at all points
  2. rate of change is second derivate is increasing.
  3. slope of tangent line will be always positive at all points
  4. rate of change: the slope of tangent line is decreasing.


  1. slope of tangent line will be always negative at all points
  2. rate of change: the slope is increasingly decreasing.


  1. slope of tangent line will be always negative at all points.
    2.rate of change: here the slope will always be decreasing

if you watch closely your graph is similar to 3rd graph.
If you understood your answer now type in the comment.
WORDS TO NOTE: increasingly decreasing
Thank you .

I understand that my oscillating curve looks similar to your 3rd graph. But how does it answer my question?

The oscillating curve you show is just an example: it’s one thing that can happen. There’s no guarantee that exactly that will happen or that that pattern is a particularly common one. You can have oscillations and divergence. You can have oscillations and convergence. You can have monotonic convergence and you can also have monotonic divergence. You can have a learning curve that monotonically decreases for a while and then oscillates and diverges. Literally anything can happen. It just depends. On everything: on your data, on your hyperparameters (number of layers, number of neurons, minibatch size, learning rate, number of iterations, …).

Here’s a relevant paper from Yann LeCun’s group titled The Loss Surfaces of Multilayer Networks.

So if the question is “how can we tell if we are actually making progress or should we stop and try a different set of hyperparameters?” then something like your idea of using some type of moving average like an EWA of the last some number of cost values might be useful. The other key point to make here is that the cost just by itself is not really a very meaningful metric. We only use the cost curves for exactly that sort of qualitative judgement: do I have a convergence problem or not and is it a waste of time to do more iterations? The real metric that matters is prediction accuracy on both the training and test sets. You can sample those every 50 or 100 iterations to get another view of whether the training is working well or not.

One other important thing to note is that a lower cost does not necessarily indicate better accuracy, because the prediction accuracy for any given sample is “quantized”. The prediction value for a given true sample may go from 0.65 to 0.68 with the next 100 iterations and that will give a lower cost, but not a greater accuracy.


Hey @paulinpaloalto,

One thing I want to point out that although prediction accuracy is the metric that is mostly used for most applications, for datasets with unequal distribution of the classes, other metrics like AUROC, F1 might be more helpful!
But yeah, the cost ensures that the model pipeline works!
Thanks for Yann LeCun’s paper!


Thanks for the very detailed discussion! I understand now better that cost is not really a metric but the accuracy of the prediction accuracy. But why do we need to plot the cost as a function of the number of mini-batches?


Hi, @henrikh.

It’s not the number of mini-batches, but rather the mini-batch number over time: first mini-batch, second mini-batch, etc.

I think it’s written like that to emphasize the fact that it’s taking one gradient descent step per mini-batch. This may be helpful too.

Hope you’re enjoying the course :slight_smile:


Thanks! It is clear to me now and I really enjoy the course!

1 Like