Some serious problems with the assignment C3W2 (Zombie Detector) - underfitting problem

Hello,

I’ve been trying to pass this assignment for weeks now, but due to a serious underfitting problem (a very slow decrease in the loss function), I could not pass it. I sent my code to one of the mentors (@gent.spah ) twice, and I double-checked with him that my code is written completely correctly. With the given parameters in the assignment, I get the following result for fine-tuning (Exercise 10):
Start fine-tuning!
batch 0 of 1000, loss=1.8442183
batch 10 of 1000, loss=1.8161771
batch 20 of 1000, loss=1.7634711
batch 30 of 1000, loss=1.7045426
batch 40 of 1000, loss=1.6465161
batch 50 of 1000, loss=1.5921935
batch 60 of 1000, loss=1.5430243
batch 70 of 1000, loss=1.5005184
batch 80 of 1000, loss=1.466491
batch 90 of 1000, loss=1.4419613
batch 100 of 1000, loss=1.4258995

batch 950 of 1000, loss=1.297606
batch 960 of 1000, loss=1.2968262
batch 970 of 1000, loss=1.2960507
batch 980 of 1000, loss=1.295279
batch 990 of 1000, loss=1.2945113
Done fine-tuning!

As you may see, the loss doesn’t even get close to the value mentioned in the expected output section, which is around 0.0004. I’ve been playing with the hyperparameters for weeks and found out that by increasing the learning rate to 1 for the first 10000 batches, and to 10 for the next 10000 batches, I could reduce the loss down to around 0.001, which was still not low enough for the model to pass the assignment’s criteria, and due to Colab’s time limits, I couldn’t let it run for longer hours.

Now, I have completed the whole course (all four weeks except the 2nd week) and only this assignment is left for me to finish the course. Considering my conversation with @gent.spah , and double-checking with the Eager Few-Shot Object Detection Colab, I am pretty sure my code is written correctly, but there are possibly some version-related issues, which doesn’t allow the model to learn as it is expected to do.

At this point, I don’t know anyone responsible for this assignment to contact. I would like to get some feedback to pass this assignment (I can dm my Colab), or if this issue cannot be solved by May 27th, I would like to cancel my subscription for this course. This is the 8th course that I’m completing with Deeplearning.ai, and honestly, this assignment has been the most annoying one to troubleshoot in the past year I’ve been taking courses here. Considering there are many similar posts in this forum regarding the same problem I’ve had in this specific assignment, some serious modifications should be made to this Colab.

Thanks in advance for any help.

Hi @Amirreza_Asadzadeh I saw your code and it seemed to be alright, but further investigation is needed thoroughly. Maybe @Pere_Martra and/or @Wendy can have a look at it as well…

This is the most difficult lab in the specialization btw, If I was you I would reset the whole lab and try again from scratch, I am pretty sure there is an error somewhere that the eye is not picking up.

I will try and have a look on it again when I get more time, I am normally teaching as well at a University and I can not spend a lot of time in one question only.

1 Like

Hi @Amirreza_Asadzadeh , send me your notebook if you want, i will try to look for something. I remember that I have a hard time trying to solve this assignment.

It’s hard, but you are really close to end the specialisation!

1 Like

I will try and include you in the inbox @Pere_Martra :slight_smile:

Thanks @gent.spah for your prompt responses. I took your advice and reloaded my colab and started from scratch, but again it didn’t work. Thanks for mentioning the other mentors.

Hello @Pere_Martra , I noticed that @gent.spah included you in our mailbox, where you can find my colab. However, I will send again the latest version of my colab to you directly.

@Amirreza_Asadzadeh, I’m also happy to take a look if you want to DM a copy to me, too.

1 Like

I have included you in the inbox as well @Wendy

1 Like

@Amirreza_Asadzadeh, I did a diff between your code and mine and see a typo in yours that I suspect is the culprit:

In exercise 6.1, your code has _base_tower_layer_for_heads = ...
but it should be _base_tower_layers_for_heads, with an s at the end of layers

Try making that change and let us know how it works

4 Likes

That’s interesting, one would think a mis-spelled variable name would cause a more obvious error (syntax or undefined variable).

@TMosh, true for many cases, but this happened to be a situation of keyword arg assignment for a checkpoint, which meant that when the checkpoint was restored it was “restoring” the wrong value. A very unlucky place to make a typo.

Thanks a lot, @Wendy ! It finally worked. I cannot believe I made the same mistake when I was going for it from scratch for the second time. The loss became as low as the expected value in around 200 batches with the given hyperparameters, and my assignment was eventually passed after weeks of debugging. Definitely, a very unlucky place to have a typo. I’ll be much more careful in my future experiences when restoring the model checkpoints. Thank you again!

jajajajjaja! Oh @wendy! I didn’t see your message I spend some hour just for find the same problem!

2 Likes

Yay! @Amirreza_Asadzadeh, that’s great news!

@Pere_Martra, sorry you spent so much time on it! I was hoping I’d be able to find it quickly with a diff before you dug in :frowning_face:

2 Likes

Well done @Wendy and @Pere_Martra , thanks a lot for your help. I would have never thought the problem was in a typo.

Hi @Amirreza_Asadzadeh, I am facing the same issue but i checked that there is no typo in my code, tried HPT multiple times but still loss is not decreasing only, not sure what mistake i am making, can you help me here if possible?

@jay_mangi, are you seeing loss decreasing slowly or not at all? If it is slowly, one possible cause is an issue with restoring the checkpoint. In @Amirreza_Asadzadeh’s case, it was due to a typo, but there could be other reasons.

In any case, start by reviewing Exercises 6.2 & 6.3: Restore the checkpoint to make sure you are restoring the checkpoint properly.

Another potential issue with loss calculations is if there is a problem with the predicting bboxes. If you are using option 1 in the “Prepare data for training” section, try switching to option 2, at least until you get things working. Then, you can go back later and get your option 2 approach working, if you want to.

I am trying multiple HPT but getting loss to 0.62 only it’s not reducing than this, however exercise tells if it’s less than 1 then i would be good but then in Exercise 11 i am not getting prediction close to tagged( Used Option 1).

@jay_mangi,
As a next step, try option 2 and see if you can get it working with that

I have tried with option 2 only

Hmm. OK, then. The most common hard-to-find issue with the loss being too large tends to be some error with restoring the checkpoint - typos, accidentally using the wrong variable, etc. But, if you’ve already reviewed Exercises 6.2 & 6.3 carefully and are still stuck, feel free to dm me a copy of your .ipynb and I’ll take a look to see if I notice anything.