C4_W3: UNet training for > 20 epochs hit loss: nan

I just experiment with the image segmentation UNet notebook and found out that training the model for more than 20 epochs will hit loss: nan error. I try to solve it by adding BatchNormalization everywhere in the model after every Conv2D but to no avail. Any insight on this?

Hi @khteh

Getting loss: nan after around 20 epochs usually points to training instability rather than architecture issues. Adding BatchNormalization is a good start, but NaNs come from an excessively high learning rate, exploding gradients, or invalid operations (e.g. division by zero).

Try lowering your learning rate by 10x, add gradient clipping, and verify your input normalization. Also, ensure your loss function (e.g., Dice, BCE, or combined) isn’t producing division by zero when masks are empty. If the loss remains unstable, you can monitor gradients during training to pinpoint where NaNs first appear — that often reveals whether the issue is numerical or data-related.

Hope it helps! Feel free to ask if you need further assistance.

1 Like

(1) Try lowering your learning rate by 10x
I know this but haven’t tried.
(2) add gradient clipping
(3) and verify your input normalization.
I try https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization and the nan happens as early as Epoch 10.
(4) Also, ensure your loss function (e.g., Dice, BCE, or combined) isn’t producing division by zero when masks are empty.
(5) If the loss remains unstable, you can monitor gradients during training to pinpoint where NaNs first appear

Can you advise (2), (4) and (5) as to how to do that?

One other thing to note is that just because the loss becomes NaN does not necessarily mean that the training is failing. With the Cross Entropy loss you can get NaN if the softmax output saturates which is the equivalent of getting exactly 1 as the output of sigmoid, which can happen because of rounding in floating point. In other words the answer becomes too good. The training may still be working in that case, because the gradients don’t actually depend on the J value itself, only the derivatives and they still are valid.

So J is a pretty crude metric. You should also compute the accuracy every epoch and see if that is improving or not.

There are ways to prevent the saturation behavior as described on this thread.

1 Like

One other note: the way the notebook is given to us, it runs 40 epochs of training. I just went back and ran my notebook again and I do not get NaN in 40 epochs. So maybe there is some other problem here to be investigated …

1 Like

Hello @khteh,

Like Paul, I have also run my full-mark notebook for 40 epochs on Coursera, and it ended up without NaN. I have two suggestions:

  1. If you keep a copy of your current notebook, start with a new one again, and make changes only in the sections marked by ### START CODE HERE and ### END CODE HERE, then we will all be pretty much on the same page (possibly except the answers ofcourse) and it’s easier to discuss. If your new notebook ends up with no NaN as well, you may find the cause by comparing the difference between your current and the new notebook.

  2. If you want to debug it like a real problem, you may add two callbacks. The first one should save a model checkpoint at the end of each epoch, then you may obverse the trends of some aggregated values (max/min/count of NaN/count of Inf) over checkpoints. The second callback can halt the training process once the loss equals to NaN and save everything (the minibatch, the model, the optimizer, and so on) at that training step. You may then reproduce the NaN, examine each of the saved elements and examine the output of each model layer to figure out where went wrong.

For the sake of passing the assignment, suggestion 1 is recommended. For how to use callbacks, googling or the google AI mode should give you many examples and explanations.

Good luck, and cheers,
Raymond

I completed the course 2 weeks ago and my subscription has expired and don’t have access to the CoLab env. Running the notebook locally on my laptop for 30 epochs shows that the loss: nan does happen in earlier epoch than shown in the collapsed cell output:

I use 100 out of the original dataset of 1060 as evaluation dataset and use EarlyStopping to bail out after 3 successive accuracy metric plateau but to no avail. Although the training does continue without apparent harm, the prediction fails. The generated image segmentation is empty.


What else can / should I do?

Have you duplicated the versions of all the packages that are being used when you run the notebook on Coursera? Probably not, since you no longer have access to the Coursera version. The notebook is written to use the versions of all packages that were current in April of 2021. If you run things with current versions, there is no guarantee that it will work the same way.

Not due to environment / versions of the system. Adam(self._learning_rate, clipnorm=1.0) works.

Well, just to be completely scientific here, you changed the parameters, so perhaps that was necessary because you are using different versions of the various packages. In other words, I don’t think you can scientifically claim that you have proved this result is not the result of versionitis.

But if you have gotten it to work, then perhaps this point is not worth any further expense in terms of mental effort.

1 Like