Zombie Detector (C3W2) - Training Step Loss Going Crazy

Matthew_DeHaven · April 1, 2023, 8:49pm

I’ve been trying to do the Zombie Detector assignment, and every cell produces output matching the expected output, until I try to run the training loop. Here’s what happens when I do:

Start fine-tuning!
batch 0 of 100, loss=1.1923661
batch 10 of 100, loss=6234.3574
batch 20 of 100, loss=23406.596
batch 30 of 100, loss=29222.49
batch 40 of 100, loss=31061.418
batch 50 of 100, loss=31496.303
batch 60 of 100, loss=31441.637
batch 70 of 100, loss=31216.28
batch 80 of 100, loss=30931.404
batch 90 of 100, loss=30625.77
Done fine-tuning!

What could be causing this? Also: Is the training step function supposed to have the “model.provide_groundtruth” command inside “with tf.GradientTape() as tape”, as it is in the tutorial collab notebook (though not mentioned in the instructions)? When I put it in, I get the crazy numbers as above. But without it (with everything else there as per instructions), I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-202-37e437a735b4> in <cell line: 3>()
     15 
     16     # Training step (forward pass + backwards pass)
---> 17     total_loss = train_step_fn(image_tensors, 
     18                                gt_boxes_list,
     19                                gt_classes_list,

1 frames
/usr/local/lib/python3.9/dist-packages/tensorflow/python/framework/func_graph.py in autograph_handler(*args, **kwargs)
   1145           except Exception as e:  # pylint:disable=broad-except
   1146             if hasattr(e, "ag_error_metadata"):
-> 1147               raise e.ag_error_metadata.to_exception(e)
   1148             else:
   1149               raise

ValueError: in user code:

    File "<ipython-input-163-f871aa37b683>", line 45, in train_step_fn  *
        losses_dict = model.loss(prediction_dict, true_shape_tensor)
    File "/usr/local/lib/python3.9/dist-packages/object_detection/meta_architectures/ssd_meta_arch.py", line 876, in loss  *
        location_losses = self._localization_loss(
    File "/usr/local/lib/python3.9/dist-packages/object_detection/core/losses.py", line 78, in __call__  *
        target_tensor = tf.where(tf.is_nan(target_tensor),

    ValueError: Shapes must be equal rank, but are 3 and 1 for '{{node Loss/Loss/Select}} = Select[T=DT_FLOAT](Loss/Loss/IsNan, concat_1, Loss/stack_2)' with input shapes: [0], [5,51150,4], [0].

I’ve checked everything else, and it matches what the instructions specify. What could be wrong here? Thank you.

hitch22 · April 4, 2023, 4:59am

Had a similar issue. my loss grew to something crazy like 9000.

What resolved the issue was inspecting the list of variables to be retrained.

Since the task is few shot learning with a few shots of images, there are only few variables to retrain.

It’s mentioned in the notebook. Find it and hopefully you will resolve the issue

Matthew_DeHaven · April 7, 2023, 1:11am

Yeah, that was it. Thank you.

It seems I was misled by the part of the instructions which mentioned layers prefixed with “WeightSharedConvolutionalBoxPredictor/BoxPredictionTower” and “WeightSharedConvolutionalBoxPredictor/ClassPredictionTower” - you’re not supposed to include those for retraining, just the first two.

Amirreza_Asadzadeh · May 4, 2023, 1:37am

@gent.spah

I have a similar problem, my loss will get stuck and won’t decrease. I read all related posts, triple-checked my checkpoints and the layers to re-train, and also made sure that my train_step_fn is written correctly. However, I cannot fix this issue after days of working on it. Can I send my notebook file privately to you or any other mentor to comment on the areas to fix? This assignment and its debugging is taking way more time than it should be.

Thanks in advance for any suggestion and help.

gent.spah · May 4, 2023, 7:12am

Yes send it in private I can have a look at it…

gent.spah · May 5, 2023, 7:43am

Hi @Amirreza_Asadzadeh, one thing I picked up easily from your notebook is that you haven’t set the learning rate properly. Go back and change it as suggested and try training again, lets see what happens then…

Amirreza_Asadzadeh · May 9, 2023, 11:26pm

Thanks @gent.spah for looking through my code. Although you’re right, I used 0.01 as learning rate in my initial runs, but since it didn’t work, I tried different values for the hyper-parameters, e.g., learning rate, number of batches, momentum, and even set Nesterov to True to see if that helps, but none of these changes eventually led to an acceptable loss/accuracy. So, I believe some other part of the code should be problematic, and I don’t know yet where after days of debugging. Would you mind to look at my code again please?

gent.spah · May 11, 2023, 11:34am

Yeah I will have another look on it again.

Baburam_Chaudhary · May 13, 2023, 6:26pm

In C3W2 Assignment
prediction_dict = model.predict(true_shape_tensor, ?)

losses_dict = model.loss(prediction_dict, ?)

what you kept as param '?'above. I am stucked here, Please any suggestions.

gent.spah · May 14, 2023, 10:03am

In the lines of code and text above there is given an example for you, simply just follow that example.

Baburam_Chaudhary · May 15, 2023, 1:03pm

Ok thanks, it worked.

Topic		Replies	Views
Error in C3W2 zombie detector assignment Advanced Computer Vision with TensorFlow week-module-2	1	447	July 4, 2023
Assignment C3W2 Zombie Detector Advanced Computer Vision with TensorFlow week-module-2	2	389	July 9, 2024
[C3W2_Assignment] Zombie detector Advanced Computer Vision with TensorFlow week-module-2	5	429	September 28, 2023
C3w2 zombi detection custom train issue Advanced Computer Vision with TensorFlow week-module-2	15	598	July 10, 2023
Exercise 10: Define the training step Advanced Computer Vision with TensorFlow week-module-2	6	650	January 25, 2022

Zombie Detector (C3W2) - Training Step Loss Going Crazy

Related topics