I am training a image captioning model and getting Nan in my loss

i am trainning a image caption model using tensorflow.iam using fliker8K dataset.i have used resnet50 to get the encoding of all my images shaped as (m,49,2048) and stored them for trainning use. i have used glove 6B 300d vectors for my vocab and embedding layer matrix. i have transformed my captions using stringlookup layer in shapes as (m,37) for training set and (m,32) for dev set and saved them too for direct use in trainning.
this is my model code
"def model_build():
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
image = tf.keras.Input((49, 2048))
input_caption = tf.keras.Input((None,))

    x_image = Dense(1024, activation='relu')(image)
    x_image = Dense(512, activation='relu')(x_image)

    embedding_layer = Embedding(400004, 300, trainable=False, mask_zero=False)
    embedding_layer.build((None,))
    embedding_layer.set_weights([emb_matrix])

    x_caption = embedding_layer(input_caption)
    x_caption = LSTM(512, return_sequences=True)(x_caption)

    attention = MultiHeadAttention(num_heads=1, key_dim=64)(query=x_caption, value=x_image)

    x = tf.keras.layers.Add()([x_caption, attention])
    x = LayerNormalization(epsilon=1e-6)(x)
    x = tf.keras.layers.Dropout(0.3)(x)

    x = LSTM(256, return_sequences=True)(x)
    x = tf.keras.layers.Dropout(0.3)(x)

    logits = Dense(400004, activation='linear',name="logits_layer")(x)
    logits = tf.keras.layers.Lambda(lambda t: tf.clip_by_value(t, -10.0, 10.0))(logits)

    model = tf.keras.Model(inputs=[image, input_caption], outputs=logits)
    model.compile(optimizer=Adam(learning_rate=1e-4, clipnorm=1.0),
                  loss=SparseCategoricalCrossentropy(from_logits=False, ignore_class=0),
                  metrics=[masked_accuracy])
    return model

"
now when i train my model for few epochs on 1 image it gives 100% accuracy and overfit as expected and on 5 images 93% accuracy but when i train my model on complete dataset around 6000 images in my train split i get nan loss in the middle of ongoing epoch around after 1000 images has been done. it happens no matter from where i start in my dataset i get nan loss after 1000 images.my data is fine I checked it.now I used these two callbacks
"class DebugLogitsCallback(tf.keras.callbacks.Callback):
def init(self, input_data):
self.input_data = input_data # A sample batch of (images, captions)

def on_train_batch_end(self, batch, logs=None):
    submodel = tf.keras.Model(inputs=self.model.inputs,
                              outputs=self.model.get_layer("logits_layer").output)
    sample_logits = submodel(self.input_data, training=False)
    max_logit = tf.reduce_max(sample_logits).numpy()
    min_logit = tf.reduce_min(sample_logits).numpy()
    print(f"Batch {batch}: Logits max = {max_logit:.4f}, min = {min_logit:.4f}")

class NaNLossCallback(tf.keras.callbacks.Callback):
def on_train_batch_end(self, batch, logs=None):
if logs[β€œloss”] is not None and tf.math.is_nan(logs[β€œloss”]):
print(f"NaN loss at batch {batch}")
self.model.stop_training = True

sample_batch = [train_images[:1], train_input_captions[:1]]
debug_callback = DebugLogitsCallback(sample_batch)

"
and I got this result
"history=model.fit(
x=[train_images,train_input_captions],y=train_label_captions,
epochs=50,
batch_size=8,
validation_data=([dev_images,dev_input_captions],dev_label_captions),
callbacks=[NaNLossCallback(),debug_callback]
)
Epoch 1/50
I0000 00:00:1749020366.186489 1026 cuda_dnn.cc:529] Loaded cuDNN version 90300
I0000 00:00:1749020366.445219 1028 cuda_dnn.cc:529] Loaded cuDNN version 90300
Batch 0: Logits max = 0.0634, min = -0.0696
1/708 ━━━━━━━━━━━━━━━━━━━━ 2:16:45 12s/step - loss: 12.8995 - masked_accuracy: 0.0000e+00Batch 1: Logits max = 0.0622, min = -0.0707
2/708 ━━━━━━━━━━━━━━━━━━━━ 4:30 383ms/step - loss: 12.8984 - masked_accuracy: 0.0000e+00 Batch 2: Logits max = 0.0796, min = -0.0721
3/708 ━━━━━━━━━━━━━━━━━━━━ 4:27 380ms/step - loss: 12.8975 - masked_accuracy: 7.8064e-04Batch 3: Logits max = 0.0972, min = -0.0727
4/708 ━━━━━━━━━━━━━━━━━━━━ 4:25 378ms/step - loss: 12.8969 - masked_accuracy: 0.0021 Batch 4: Logits max = 0.1136, min = -0.0749
5/708 ━━━━━━━━━━━━━━━━━━━━ 4:24 376ms/step - loss: 12.8964 - masked_accuracy: 0.0035Batch 5: Logits max = 0.1281, min = -0.0797
6/708 ━━━━━━━━━━━━━━━━━━━━ 4:23 376ms/step - loss: 12.8960 - masked_accuracy: 0.0045Batch 6: Logits max = 0.1438, min = -0.0845
7/708 ━━━━━━━━━━━━━━━━━━━━ 4:23 376ms/step - loss: 12.8957 - masked_accuracy: 0.0054Batch 7: Logits max = 0.1606, min = -0.0905
8/708 ━━━━━━━━━━━━━━━━━━━━ 4:23 377ms/step - loss: 12.8954 - masked_accuracy: 0.0062Batch 8: Logits max = 0.1781, min = -0.0980
9/708 ━━━━━━━━━━━━━━━━━━━━ 4:23 377ms/step - loss: 12.8952 - masked_accuracy: 0.0068Batch 9: Logits max = 0.1957, min = -0.1072
10/708 ━━━━━━━━━━━━━━━━━━━━ 4:22 376ms/step - loss: 12.8950 - masked_accuracy: 0.0073Batch 10: Logits max = 0.2144, min = -0.1171
.
.
.
.
120/708 ━━━━━━━━━━━━━━━━━━━━ 3:41 376ms/step - loss: 12.8935 - masked_accuracy: 0.0118Batch 120: Logits max = 3.4171, min = -2.2954
121/708 ━━━━━━━━━━━━━━━━━━━━ 3:40 376ms/step - loss: 12.8935 - masked_accuracy: 0.0118Batch 121: Logits max = 3.4450, min = -2.3163
122/708 ━━━━━━━━━━━━━━━━━━━━ 3:40 376ms/step - loss: inf - masked_accuracy: 0.0118 Batch 122: Logits max = 3.4731, min = -2.3371
123/708 ━━━━━━━━━━━━━━━━━━━━ 3:40 376ms/step - loss: inf - masked_accuracy: 0.0118Batch 123: Logits max = 3.5013, min = -2.3580
124/708 ━━━━━━━━━━━━━━━━━━━━ 3:39 376ms/step - loss: inf - masked_accuracy: 0.0118NaN loss at batch 124
Batch 124: Logits max = 3.5296, min = -2.3789
708/708 ━━━━━━━━━━━━━━━━━━━━ 78s 94ms/step - loss: nan - masked_accuracy: 0.0121 - val_loss: nan - val_masked_accuracy: nan
"

can someone explain to me the reason and the cause for nan loss and how to fix it.

Perhaps an image or some images on the training set are not the right type of format…

Thanks for the reply but thats not the issue all images are already transformed into feature map by using resnet50.I have already check the trainning data and its fine with no issues.

1 Like

Finally fixed the problem it was the gpu fault on kaggle.Using 2 gpu caused this error on 1 it is working fine.
The inf/nan loss being an artifact of the MirroredStrategy (multi-GPU) execution, specifically likely one of these: One of the GPUs calculates an inf (either loss or gradient) for its shard of data around batch 120, which then taints the CollectiveReduceV2 result.

Alright good, how did you fix it?

Not so much as a fix but i avoided the issue by not using 2 gpu simantaneously to train my model.It is happening on kaggle only i think.

1 Like