Exercise 10: Define the training step - huge training loss

Hello,

I got stuck at exercise 10 for a long time. I think the mistake is on my train_step_fn, but I can’t figure out what is the issue. What I’m doing is:

  • a loop through the image_list and for each img there, running model.preprocess(img) to get the processed image and its shape (both stored in lists)

  • then I tf.concat the preprocessed image list, and the true shape list

  • run model.prediction using the two concatenated lists

  • run model.provide_groundtruth having groundtruth_boxes_list and groundtruth_classes_list as parameters

  • run model.loss(prediction_dict, true_shape_tensor) to get the losses dict

  • sum up the localization loss and the classification loss from the losses dict

  • finally, using the value of the total loss, to compute and optimize the gradients

Is there something missing or that I’m doing wrong in the list above??

Also, I have seen some other similar topics about this issue. They suggest you to check exercises 6.2 and 6.3. I have done that but didn’t find any problem since the output shows no inconsistencies with the expected ones. I checked if I downloaded the right retinanet as suggested as solution to another topic, and no issues with that too.

If someone know something that can help, please let me know.

Thanks in advance!

Hello,

The description of the tasks seem to be OK. If you haven’t resolved the issue yet I could have a look at your notebook on private message.

Hi,

I saw your notebook slightly on my phone, i couldnt find any issues at first glance.

The only thing i would suggest for now is to make sure you are choosing and defining the checkpoint correctly and the layers to fine tune, better use the indexing way rather than by name.

I would suggest to go over these once and lets see what happens.

One more additional hint for anyone who is still stuck – when checking that the checkpoints are defined correctly in Exercises 6.1 and 6.2, remember to carefully check the spelling of the parameter names for tf.train.Checkpoint to make sure you don’t have a typo.

I tried all methods but still getting a loss above 1.