Mini-batch gradient descent decreasing

henrikh · September 2, 2021, 9:41am

Hi,

I don’t understand why the cost in the second video of the DLS course 2 should decrease even with oscillations.

The figure is the dependence of the cost function on the number of mini-batches. I think the curve will decrease if we plot, for example, the average cost functions of mini-batches as a function of the number of iterations.

I am not sure if I am right though.

Henrikh

starboy · September 2, 2021, 11:50am

let’s take these 4 examples:
3.1

slope of tangent line will be always be positive at all points
rate of change is second derivate is increasing.
slope of tangent line will be always positive at all points
rate of change: the slope of tangent line is decreasing.

slope of tangent line will be always negative at all points
rate of change: the slope is increasingly decreasing.

3.4

slope of tangent line will be always negative at all points.
2.rate of change: here the slope will always be decreasing

if you watch closely your graph is similar to 3rd graph.
If you understood your answer now type in the comment.
WORDS TO NOTE: increasingly decreasing
Thank you .

henrikh · September 2, 2021, 12:07pm

I understand that my oscillating curve looks similar to your 3rd graph. But how does it answer my question?

paulinpaloalto · September 2, 2021, 11:08pm

The oscillating curve you show is just an example: it’s one thing that can happen. There’s no guarantee that exactly that will happen or that that pattern is a particularly common one. You can have oscillations and divergence. You can have oscillations and convergence. You can have monotonic convergence and you can also have monotonic divergence. You can have a learning curve that monotonically decreases for a while and then oscillates and diverges. Literally anything can happen. It just depends. On everything: on your data, on your hyperparameters (number of layers, number of neurons, minibatch size, learning rate, number of iterations, …).

Here’s a relevant paper from Yann LeCun’s group titled The Loss Surfaces of Multilayer Networks.

So if the question is “how can we tell if we are actually making progress or should we stop and try a different set of hyperparameters?” then something like your idea of using some type of moving average like an EWA of the last some number of cost values might be useful. The other key point to make here is that the cost just by itself is not really a very meaningful metric. We only use the cost curves for exactly that sort of qualitative judgement: do I have a convergence problem or not and is it a waste of time to do more iterations? The real metric that matters is prediction accuracy on both the training and test sets. You can sample those every 50 or 100 iterations to get another view of whether the training is working well or not.

One other important thing to note is that a lower cost does not necessarily indicate better accuracy, because the prediction accuracy for any given sample is “quantized”. The prediction value for a given true sample may go from 0.65 to 0.68 with the next 100 iterations and that will give a lower cost, but not a greater accuracy.

thearkamitra · September 3, 2021, 2:45am

Hey @paulinpaloalto,

One thing I want to point out that although prediction accuracy is the metric that is mostly used for most applications, for datasets with unequal distribution of the classes, other metrics like AUROC, F1 might be more helpful!
But yeah, the cost ensures that the model pipeline works!
Thanks for Yann LeCun’s paper!

henrikh · September 3, 2021, 9:04am

Thanks for the very detailed discussion! I understand now better that cost is not really a metric but the accuracy of the prediction accuracy. But why do we need to plot the cost as a function of the number of mini-batches?

Henrikh

nramon · September 3, 2021, 11:07am

Hi, @henrikh.

It’s not the number of mini-batches, but rather the mini-batch number over time: first mini-batch, second mini-batch, etc.

I think it’s written like that to emphasize the fact that it’s taking one gradient descent step per mini-batch. This may be helpful too.

Hope you’re enjoying the course

henrikh · September 3, 2021, 11:35am

Thanks! It is clear to me now and I really enjoy the course!

Topic		Replies	Views
Doubt regarding learning rate decay mechanism Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	514	January 17, 2023
Week 02 - 6.1 Mini-Batch Gradient Descent → Why not zig zac cost Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	513	August 30, 2021
Course2_week2_assignment Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	612	June 28, 2021
Understanding Mini batch size Improving Deep Neural Networks: Hyperparameter tun coursera-platform	1	571	July 10, 2021
Gradient descent Neural Networks and Deep Learning coursera-platform	4	655	December 15, 2021

Mini-batch gradient descent decreasing

Related topics