C5W2A2: Emojify_V2: "loss: nan" and the thus model does not train

Greetings,

My code for Exercise 5 - Emojify_V2 passes all unit test cases, and even the automatic grader gave it 100%. However, while training the Emojify_v2 model, I sometimes get “loss: nan” and the model does not actually train. I checked the training dataset and it does not have null values (it has valid sentence indices). I have also restarted the kernel multiple times. Any explanation on why we get loss: nan? thanks

Epoch 1/50
5/5 [==============================] - 0s 10ms/step - loss: nan - accuracy: 0.1667
Epoch 2/50
5/5 [==============================] - 0s 22ms/step - loss: nan - accuracy: 0.1667
Epoch 3/50
5/5 [==============================] - 0s 22ms/step - loss: nan - accuracy: 0.1667
Epoch 4/50
5/5 [==============================] - 0s 22ms/step - loss: nan - accuracy: 0.1667

Did you change any of the training parameters, like the batch size?

yes, I changed number of epochs and batch size. Same error.

Hey @Mo_Mosaad_H,
Welcome to the community. If I am not wrong, the notebook uses 50 epochs to train the model and it seems like you have used the same as well (as per your code shown). Now, you might have changed the batch size, which could result in extremely large or extremely small values of loss, which resulted in loss = nan, and based on your output, it seems like that the loss is too large, since it is nan from the first epoch itself. Also, since the first iteration through the network gives loss = nan, hence, the back-propagation also fails, since it is not possible to differentiate something that is nan, and hence, the loss doesn’t decrease.

If the epochs = 50 only, then try changing the batch-size and see if the issue resolves. The trick is to set the initial value of these parameters so that the loss value remains within the constraints of Python variables, i.e., neither under-flows nor over-flows. And once they are set to good enough values, you have a finite loss value, which can be back-propagated through, and after some CO_2 emissions, yous loss will eventually come down and your accuracy will go up.

Let us know if this helps.

Cheers,
Elemento

Thanks for your response.
I restarted the notebook. The problem still happens but randomly and I cannot really understand why it is happening. I ran the training with the provided parameters and it finished. But the accuracy was about 80%, less than the expected one between 90-100%. I ran it again, I got the “loss : nan” at later epochs, as shown below. Then, I ran it for a third time and the loss: nan appeared from the first epoch.

Epoch 35/50
5/5 [==============================] - 0s 24ms/step - loss: 0.3408 - accuracy: 0.9697
Epoch 36/50
5/5 [==============================] - 0s 36ms/step - loss: 0.3207 - accuracy: 0.9697
Epoch 37/50
5/5 [==============================] - 0s 22ms/step - loss: 0.3344 - accuracy: 0.9545
Epoch 38/50
5/5 [==============================] - 0s 25ms/step - loss: 0.3091 - accuracy: 0.9773
Epoch 39/50
5/5 [==============================] - 0s 24ms/step - loss: 0.3174 - accuracy: 0.9621
Epoch 40/50
5/5 [==============================] - 0s 35ms/step - loss: 0.4215 - accuracy: 0.9697
Epoch 41/50
5/5 [==============================] - 0s 34ms/step - loss: nan - accuracy: 0.3561
Epoch 42/50
5/5 [==============================] - 0s 23ms/step - loss: nan - accuracy: 0.1667
Epoch 43/50
5/5 [==============================] - 0s 24ms/step - loss: nan - accuracy: 0.1667

Hey @Mo_Mosaad_H,
That’s indeed strange. The code that you have shown now indicates that your loss became nan at the 41st epoch only, and was working fine up to that. Can you please DM your notebook as an attachment to me? I can only tell anything after taking a look at the code once, and after trying it out myself.

Cheers,
Elemento

For what it’s worth, I had the same issue as you did. I had used the relu activation function which I switched to sigmoid and the model converged very strongly:
Epoch 33/50 5/5 [==============================] - 2s 362ms/step - loss: 0.0658 - accuracy: 0.9924
Epoch 34/50 5/5 [==============================] - 2s 347ms/step - loss: 0.0462 - accuracy: 1.0000
Epoch 35/50 5/5 [==============================] - 2s 360ms/step - loss: 0.0462 - accuracy: 0.9924
Epoch 36/50 5/5 [==============================] - 2s 376ms/step - loss: 0.0400 - accuracy: 1.0000
Epoch 37/50 5/5 [==============================] - 2s 359ms/step - loss: 0.0342 - accuracy: 1.0000
Epoch 38/50 5/5 [==============================] - 2s 362ms/step - loss: 0.0289 - accuracy: 1.0000
Epoch 39/50 5/5 [==============================] - 2s 359ms/step - loss: 0.0276 - accuracy: 1.0000
Epoch 40/50 5/5 [==============================] - 2s 361ms/step - loss: 0.0258 - accuracy: 1.0000
Epoch 41/50 5/5 [==============================] - 2s 361ms/step - loss: 0.0217 - accuracy: 1.0000
Epoch 42/50 5/5 [==============================] - 2s 374ms/step - loss: 0.0199 - accuracy: 1.0000
Epoch 43/50 5/5 [==============================] - 2s 362ms/step - loss: 0.0198 - accuracy: 1.0000
Epoch 44/50 5/5 [==============================] - 2s 361ms/step - loss: 0.0153 - accuracy: 1.0000
Epoch 45/50 5/5 [==============================] - 2s 363ms/step - loss: 0.0180 - accuracy: 1.0000
Epoch 46/50 5/5 [==============================] - 2s 359ms/step - loss: 0.0158 - accuracy: 1.0000
Epoch 47/50 5/5 [==============================] - 2s 359ms/step - loss: 0.0156 - accuracy: 1.0000
Epoch 48/50 5/5 [==============================] - 2s 360ms/step - loss: 0.0145 - accuracy: 1.0000
Epoch 49/50 5/5 [==============================] - 2s 374ms/step - loss: 0.0153 - accuracy: 1.0000
Epoch 50/50 5/5 [==============================] - 2s 375ms/step - loss: 0.0136 - accuracy: 1.0000

I did not touch the batch size or other elements.

Hey @Ludovic_Legrand,
Welcome to the community. In this assignment, you are supposed to use the “Softmax” activation function, as you can read in the markdown of “Exercise 2 - model”. Mohamed used the ReLU activation function in place of the Softmax activation function, which resulted in this error.

Cheers,
Elemento