Huge underfitting on week 2 assignment

Dear colleagues,

I am getting a huge underfitting on training the zombie detection model. Here is my loss decay using learning rate 0.01:

Start fine-tuning!
batch 0 of 100, loss=1.8433523
batch 10 of 100, loss=1.8154047
batch 20 of 100, loss=1.7628807
batch 30 of 100, loss=1.7041743
batch 40 of 100, loss=1.646369
batch 50 of 100, loss=1.5922079
batch 60 of 100, loss=1.5430906
batch 70 of 100, loss=1.5004694
batch 80 of 100, loss=1.4661503
batch 90 of 100, loss=1.4412313
Done fine-tuning!

By drastically increasing the learning rate to 1 and the number of batchs to around 30k, I could find losses around 0.005 which were not enough to lead to solutions to the assigment. Then, my available time on colab finished. I am really stucked now. Can anyone please help me?

1 Like

This is a complex assignment with many steps, you are probably doing something wrong upwards which is not as it is supposed to be. Have a look on the forum first if you can find any helpful things about this assignment. Otherwise you should check all the previous steps with care.

@gent.spah @Luis_Filipe

I am encountering the exact same problems. It seems that the algorithm is not learning if the learning rate is set to a reasonable value, as suggested by the assignment. If I run the experiment of increasing it dramatically to 1, I make similar observations as @Luis_Filipe .

I have gone through the assignment from top to bottom for weeks and I can simply not figure out where the error, if any, could be. Did you spot it @Luis_Filipe ?

The learning rate of 1 is too big actually. I would suggest to pay special attentions at the restoration of model checkpoint and similar settings in the model restorations and output layers isolation.

I have also seen some learners have problems with train_step_fn() function due to not following instructions properly.

I would suggest having a look at these.

Have you fixed the problem?

I had similar symptoms with loss at about 1.3 to 1.8 and not converging. Turned out to be a misspelling.

I got it working by starting over and re-entering all the Exercise code, this time copy-pasting (eg from the Lecture Notes and hints) wherever possible to avoid misspelling any long variable names.

Then I used Colab diff between the bad version and the good version to understand what happened and sure enough I had a misspelling:

Good (from the lecture notes):


Did the edit and then Runtime->Run All and the loss converged as expected.