C4_W3: UNet training for > 20 epochs hit loss: nan

khteh · October 29, 2025, 6:11am

I just experiment with the image segmentation UNet notebook and found out that training the model for more than 20 epochs will hit loss: nan error. I try to solve it by adding BatchNormalization everywhere in the model after every Conv2D but to no avail. Any insight on this?

Alireza_Saei · October 29, 2025, 10:12am

Hi @khteh

Getting loss: nan after around 20 epochs usually points to training instability rather than architecture issues. Adding BatchNormalization is a good start, but NaNs come from an excessively high learning rate, exploding gradients, or invalid operations (e.g. division by zero).

Try lowering your learning rate by 10x, add gradient clipping, and verify your input normalization. Also, ensure your loss function (e.g., Dice, BCE, or combined) isn’t producing division by zero when masks are empty. If the loss remains unstable, you can monitor gradients during training to pinpoint where NaNs first appear — that often reveals whether the issue is numerical or data-related.

Hope it helps! Feel free to ask if you need further assistance.

khteh · October 29, 2025, 12:17pm

(1) Try lowering your learning rate by 10x
I know this but haven’t tried.
(2) add gradient clipping
(3) and verify your input normalization.
I try https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization and the nan happens as early as Epoch 10.
(4) Also, ensure your loss function (e.g., Dice, BCE, or combined) isn’t producing division by zero when masks are empty.
(5) If the loss remains unstable, you can monitor gradients during training to pinpoint where NaNs first appear

Can you advise (2), (4) and (5) as to how to do that?

paulinpaloalto · October 29, 2025, 2:41pm

One other thing to note is that just because the loss becomes NaN does not necessarily mean that the training is failing. With the Cross Entropy loss you can get NaN if the softmax output saturates which is the equivalent of getting exactly 1 as the output of sigmoid, which can happen because of rounding in floating point. In other words the answer becomes too good. The training may still be working in that case, because the gradients don’t actually depend on the J value itself, only the derivatives and they still are valid.

So J is a pretty crude metric. You should also compute the accuracy every epoch and see if that is improving or not.

There are ways to prevent the saturation behavior as described on this thread.

paulinpaloalto · October 29, 2025, 3:00pm

One other note: the way the notebook is given to us, it runs 40 epochs of training. I just went back and ran my notebook again and I do not get NaN in 40 epochs. So maybe there is some other problem here to be investigated …

rmwkwok · October 29, 2025, 10:39pm

Hello @khteh,

Like Paul, I have also run my full-mark notebook for 40 epochs on Coursera, and it ended up without NaN. I have two suggestions:

If you keep a copy of your current notebook, start with a new one again, and make changes only in the sections marked by ### START CODE HERE and ### END CODE HERE, then we will all be pretty much on the same page (possibly except the answers ofcourse) and it’s easier to discuss. If your new notebook ends up with no NaN as well, you may find the cause by comparing the difference between your current and the new notebook.
If you want to debug it like a real problem, you may add two callbacks. The first one should save a model checkpoint at the end of each epoch, then you may obverse the trends of some aggregated values (max/min/count of NaN/count of Inf) over checkpoints. The second callback can halt the training process once the loss equals to NaN and save everything (the minibatch, the model, the optimizer, and so on) at that training step. You may then reproduce the NaN, examine each of the saved elements and examine the output of each model layer to figure out where went wrong.

For the sake of passing the assignment, suggestion 1 is recommended. For how to use callbacks, googling or the google AI mode should give you many examples and explanations.

Good luck, and cheers,
Raymond

khteh · October 30, 2025, 3:51am

I completed the course 2 weeks ago and my subscription has expired and don’t have access to the CoLab env. Running the notebook locally on my laptop for 30 epochs shows that the loss: nan does happen in earlier epoch than shown in the collapsed cell output:

I use 100 out of the original dataset of 1060 as evaluation dataset and use EarlyStopping to bail out after 3 successive accuracy metric plateau but to no avail. Although the training does continue without apparent harm, the prediction fails. The generated image segmentation is empty.

What else can / should I do?

paulinpaloalto · October 30, 2025, 3:56am

Have you duplicated the versions of all the packages that are being used when you run the notebook on Coursera? Probably not, since you no longer have access to the Coursera version. The notebook is written to use the versions of all packages that were current in April of 2021. If you run things with current versions, there is no guarantee that it will work the same way.

khteh · November 8, 2025, 10:38am

Not due to environment / versions of the system. Adam(self._learning_rate, clipnorm=1.0) works.

paulinpaloalto · November 8, 2025, 4:21pm

Well, just to be completely scientific here, you changed the parameters, so perhaps that was necessary because you are using different versions of the various packages. In other words, I don’t think you can scientifically claim that you have proved this result is not the result of versionitis.

But if you have gotten it to work, then perhaps this point is not worth any further expense in terms of mental effort.

Topic		Replies	Views
0 Vote C3W1_Assignment model loss=nan Advanced Computer Vision with TensorFlow week-module-1	2	374	October 9, 2023
Week3 , PA2 - Image Segmentation - Loss increases and accuracy reduces after 30 epochs Convolutional Neural Networks coursera-platform	2	403	July 16, 2023
Getting nan losses while training the model Advanced Learning Algorithms week-module-2	7	533	July 12, 2022
Problem training Segmentation model Convolutional Neural Networks coursera-platform	1	519	June 9, 2021
C5W2A2: Emojify_V2: "loss: nan" and the thus model does not train Sequence Models coursera-platform	7	590	March 15, 2023

C4_W3: UNet training for > 20 epochs hit loss: nan

Related topics