Zombie Detector (C3W2) - Training Step Loss Going Crazy

I’ve been trying to do the Zombie Detector assignment, and every cell produces output matching the expected output, until I try to run the training loop. Here’s what happens when I do:

Start fine-tuning!
batch 0 of 100, loss=1.1923661
batch 10 of 100, loss=6234.3574
batch 20 of 100, loss=23406.596
batch 30 of 100, loss=29222.49
batch 40 of 100, loss=31061.418
batch 50 of 100, loss=31496.303
batch 60 of 100, loss=31441.637
batch 70 of 100, loss=31216.28
batch 80 of 100, loss=30931.404
batch 90 of 100, loss=30625.77
Done fine-tuning!

What could be causing this? Also: Is the training step function supposed to have the “model.provide_groundtruth” command inside “with tf.GradientTape() as tape”, as it is in the tutorial collab notebook (though not mentioned in the instructions)? When I put it in, I get the crazy numbers as above. But without it (with everything else there as per instructions), I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-202-37e437a735b4> in <cell line: 3>()
     15 
     16     # Training step (forward pass + backwards pass)
---> 17     total_loss = train_step_fn(image_tensors, 
     18                                gt_boxes_list,
     19                                gt_classes_list,

1 frames
/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/func_graph.py in autograph_handler(*args, **kwargs)
   1145           except Exception as e:  # pylint:disable=broad-except
   1146             if hasattr(e, "ag_error_metadata"):
-> 1147               raise e.ag_error_metadata.to_exception(e)
   1148             else:
   1149               raise

ValueError: in user code:

    File "<ipython-input-163-f871aa37b683>", line 45, in train_step_fn  *
        losses_dict = model.loss(prediction_dict, true_shape_tensor)
    File "/usr/local/lib/python3.9/dist-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 876, in loss  *
        location_losses = self._localization_loss(
    File "/usr/local/lib/python3.9/dist-packages/object_detection/core/losses.py", line 78, in __call__  *
        target_tensor = tf.where(tf.is_nan(target_tensor),

    ValueError: Shapes must be equal rank, but are 3 and 1 for '{{node Loss/Loss/Select}} = Select[T=DT_FLOAT](Loss/Loss/IsNan, concat_1, Loss/stack_2)' with input shapes: [0], [5,51150,4], [0].

I’ve checked everything else, and it matches what the instructions specify. What could be wrong here? Thank you.

Had a similar issue. my loss grew to something crazy like 9000.

What resolved the issue was inspecting the list of variables to be retrained.

Since the task is few shot learning with a few shots of images, there are only few variables to retrain.

It’s mentioned in the notebook. Find it and hopefully you will resolve the issue

1 Like

Yeah, that was it. Thank you.

It seems I was misled by the part of the instructions which mentioned layers prefixed with “WeightSharedConvolutionalBoxPredictor/BoxPredictionTower” and “WeightSharedConvolutionalBoxPredictor/ClassPredictionTower” - you’re not supposed to include those for retraining, just the first two.

@gent.spah

I have a similar problem, my loss will get stuck and won’t decrease. I read all related posts, triple-checked my checkpoints and the layers to re-train, and also made sure that my train_step_fn is written correctly. However, I cannot fix this issue after days of working on it. Can I send my notebook file privately to you or any other mentor to comment on the areas to fix? This assignment and its debugging is taking way more time than it should be.

Thanks in advance for any suggestion and help.

Yes send it in private I can have a look at it…

Hi @Amirreza_Asadzadeh, one thing I picked up easily from your notebook is that you haven’t set the learning rate properly. Go back and change it as suggested and try training again, lets see what happens then…

Thanks @gent.spah for looking through my code. Although you’re right, I used 0.01 as learning rate in my initial runs, but since it didn’t work, I tried different values for the hyper-parameters, e.g., learning rate, number of batches, momentum, and even set Nesterov to True to see if that helps, but none of these changes eventually led to an acceptable loss/accuracy. So, I believe some other part of the code should be problematic, and I don’t know yet where after days of debugging. Would you mind to look at my code again please?

Yeah I will have another look on it again.

In C3W2 Assignment
prediction_dict = model.predict(true_shape_tensor, ?)

losses_dict = model.loss(prediction_dict, ?)

what you kept as param '?'above. I am stucked here, Please any suggestions.

In the lines of code and text above there is given an example for you, simply just follow that example.

Ok thanks, it worked.

1 Like