A DLS computing efficiency question

Hi Mentors
Another ‘improving my understanding’ question.
At no point in the coding examples in DLS 1 or DLS 2 do I see a decision point that says
“OK we have reached a position where the gradient is close enough to zero for training purposes so lets stop the iterations”
What we do is set a number of iterations and follow that slavishly, regardless of the level of accuracy it achieves.
Why is this? It would seem a simple enough and would potentially stop a lot of guesswork?


After you move in to using TensorFlow you’ll have an exercise in writing callbacks. You can have the training engine call in to your custom code and implement controls such as you envision there.

Here’s a preview:

OK thanks AI curious
I guess at this stage we don’t need the added complication and so it is left until the TensorFlow unit.

It’s an interesting question or set of questions. Not to put too fine a point on it, but I think Prof Ng did spend quite a bit of time talking about the whole set of questions around “How do I recognize whether my model works as well as it should/could and what do I do about it when it doesn’t?” In fact, you could say that’s the main topic of Week 1 and Week 2 of Course 2 and essentially all of Course 3. “Is convergence working well” is just one minor aspect of that bigger picture.

Note that in all the cases in which Prof Ng shows us how to build something ourselves, he’s showing us the “bare bones” version. E.g. consider the case of a fixed learning rate and a fixed number of iterations. In the most recent version of these courses (April 2021) they did add lecture and assignment material about decaying learning rates. But once we graduate to frameworks, the convergence is managed for us using more sophisticated techniques involving adaptive learning rates and looking at the magnitudes of gradients and the like. Or you can also take it to another level by writing your own custom callback functions to make that type of decision, using the information that ai_curious points out.

In all this the cost J is really not very informative: it’s basically a cheap proxy for whether convergence is making progress or not. Hitting a given J value is not the goal: the accuracy the point, but that’s more expensive to compute. You’ll notice if you look at all the back propagation formulas that you never see the bare J value anywhere. All you see are derivatives of the cost function relative to various parameters.

1 Like

Hi Paul
Apologies if my continual questions are a pain.
The subject is of real interest to me and I want to make sure I come out of this with real knowledge and on topic rather than a series of misunderstandings strung together by rote-learned process.
I was always taught that “the only stupid question is the one you didn’t ask”

With respect to the cost function / convergence point that you make, I appreciate what you say but at the moment (and with my limited knowledge) J seems to be the primary parameter ensuring the NN is in a state where it can be used to make decisions about the test / subject data.

I salute your pursuit of making sure you have a solid understanding of the concepts.

With that in mind, we need to dive a little deeper on the cost issue. Note that (as we discussed on another thread), the cost J is the average of the individual loss values across all the training samples. Given that, it is completely possible that a lower J value does not get you a better accuracy value. Suppose you have a given sample which has a label of 1 (“is a cat”), but the \hat{y} value goes from 0.35 to 0.45 over the last 100 iterations. The cost J will go down, but the prediction is still wrong because the \hat{y} value needs to be > 0.5 in order for the prediction to flip to “true”, right? So the model is “less wrong” in terms of J, but it doesn’t produce a more accurate prediction. Yet …

This is a more concrete example of the point that I was trying to make in my previous reply: the actual J value is a pretty low resolution metric for the performance of your model.

Hi Paul
With my limited skill and experience I had formed a different view of what the cost function was measuring.
I did not see J as a measure of accuracy, but more an indicator of model useability.
If I may use an imaginary example;
Let us suppose NASA is sending a lander to Europa in the not too distant future.
We know that this moon of Jupiter is an ice bound planet from flypast images but we have no high resolution pictures we can use to identify suitable landing spots.
We equip the space module with a NN capable of assessing images of the surface to search out flat, clear spots in the area where they wish to explore.
They cannot use images of Europa to train the NN so they decide to use the many images of Antarctica that are available - on the basis that Europa may be like the South Pole. It’s cold and horrible so perhaps Europa is too.
However many pictures of Antarctica we use we cannot call them accurate, we simply have no idea of the realities of the ice planet. If we set the NN too accurate then we overfit the NN to Antarctica and given different circumstances on Europa we may not be able to recognise suitable landing spots.
If we underfit then the NN may misidentify dangerous features as safe.
So we need some kind of mid range so the NN can be flexible given unknown conditions.
Given the deep learning algorithms I have seen so far, the obvious parameter to use to establish the NN in a an optimum ‘usability zone’ is the cost function.
You may well amend the other hyperparameters to adjust performance but in the end you measure the cost to establish model usability and ‘best fit’ of the target data (images of the planet’s surface)

Just wanted to emphasize that my reply about callbacks only addresses the part of the original question about slavish completion of a fixed number of epochs. Unlikely one would use that approach when computation costs real money and time. Callbacks are the TensorFlow control mechanism that allow you to evaluate your model’s state incrementally. What you decide to do within the callback, such as adapt learning rate, take a snapshot, or even take an early exit, as well as what metrics to use to make those decisions, is a broader topic some of which @paulinpaloalto brings up. My guess is that because callbacks are available from common DL frameworks (eg PyTorch and TensorFlow both offer them) and this level of training loop control is not considered in the learning critical path, they are deferred from the early learning examples. You definitely see them in other course offerings, such as the TensorFlow Advanced Topics specialization.

Given a common understanding of what parameters are important in the development of the model it would be simple to control that parameter outside of any loops / matrix operations, thus not incurring any computing penalties.
The obvious question is…which parameters.

I’m not sure there is any single correct answer here and might suggest it depends on the business case of the confusion matrix cells - How bad is a false negative? Remember that model ‘accuracy’ means prediction matches label during training or evaluation, but measures nothing about whether the label was correct in the first place or the extent to which the training data mimic the intended operational environment. If you have enough context to differentiate between images of safe landing spots versus unsafe, and want to train a model that generalizes well to images it hasn’t been trained on, you will likely measure performance on more than cost alone. I found this paper written by people in the remote sensing community that surveys the subject. You might find some ideas about metrics there or in the papers they cite:


Abstract: Convolutional neural network (CNN)-based deep learning (DL) is a powerful, recently developed image classification approach. With origins in the computer vision and image processing communities, the accuracy assessment methods developed for CNN-based DL use a wide range of metrics that may be unfamiliar to the remote sensing (RS) community. To explore the differences between traditional RS and DL RS methods, we surveyed a random selection of 100 papers from the RS DL literature.

1 Like