Mini-batch gradient descent decreasing

The oscillating curve you show is just an example: it’s one thing that can happen. There’s no guarantee that exactly that will happen or that that pattern is a particularly common one. You can have oscillations and divergence. You can have oscillations and convergence. You can have monotonic convergence and you can also have monotonic divergence. You can have a learning curve that monotonically decreases for a while and then oscillates and diverges. Literally anything can happen. It just depends. On everything: on your data, on your hyperparameters (number of layers, number of neurons, minibatch size, learning rate, number of iterations, …).

Here’s a relevant paper from Yann LeCun’s group titled The Loss Surfaces of Multilayer Networks.

So if the question is “how can we tell if we are actually making progress or should we stop and try a different set of hyperparameters?” then something like your idea of using some type of moving average like an EWA of the last some number of cost values might be useful. The other key point to make here is that the cost just by itself is not really a very meaningful metric. We only use the cost curves for exactly that sort of qualitative judgement: do I have a convergence problem or not and is it a waste of time to do more iterations? The real metric that matters is prediction accuracy on both the training and test sets. You can sample those every 50 or 100 iterations to get another view of whether the training is working well or not.

One other important thing to note is that a lower cost does not necessarily indicate better accuracy, because the prediction accuracy for any given sample is “quantized”. The prediction value for a given true sample may go from 0.65 to 0.68 with the next 100 iterations and that will give a lower cost, but not a greater accuracy.