C3_W3_loading pre-trained models

In week 3 of course 3, I loaded my CycleGAN with the pre-trained models containing the optimizers and weights. Yet, when I resume the training, The losses start with:

Epoch 0: Step 0: Generator (U-Net) loss: 0.01129252791404724, Discriminator loss: 0.0007045255601406098

But they quickly dip, and then They begin to improve slowly.

Epoch 0: Step 200: Generator (U-Net) loss: 2.754218228459359, Discriminator loss: 0.2143540270254018

Epoch 0: Step 400: Generator (U-Net) loss: 2.6026798564195643, Discriminator loss: 0.2042489836178719

It seems the weights are kept but the optimizers are restarted.
Is this natural in PyTorch or Have I made a mistake?

Hi @Ali_sabzi,
Sorry for the delay! Looks like we somehow overlooked this question.

Very observant of you! What you’re seeing is due to an anomaly (bug) in the code for the train() function. You can see in these lines that the code is calculating an average loss, assuming that there will be display_step losses calculated before displaying:

            # Keep track of the average discriminator loss
            mean_discriminator_loss += disc_A_loss.item() / display_step
            # Keep track of the average generator loss
            mean_generator_loss += gen_loss.item() / display_step

But, since we initialize cur_step to 0, that means on the very first time through the loop, this check will be true:

            ### Visualization code ###
            if cur_step % display_step == 0:

In this initial case, our assumption that we are averaging display_step losses is wrong. We only have calculated one loss so far, so we shouldn’t have divided by display_step to get our average loss. That means, the actual loss for Epoch 0, step 0 is 200 times larger than what is printed (since what we printed is loss divided by display_step and display_step is 200).

One easy way to fix the bug is to initialize cur_step = 1 instead of cur_step = 0. This has the disadvantage that you won’t see anything displayed until after the first 200 times through the inner loop, and it’s nice to see something displayed right away. Fixing the code to display the first time through AND display the right loss values for that first time would mean a little more work. I’ll report this bug to the developers, but it may be low priority for them to fix it.

Hi @Wendy,
I see, so it was an error in the printed text, and The training was resumed correctly; Thank you for your answer.
I trained the network for 1000 steps, and The augmented images displayed for Epoch 0: Step 0 of my run looked better than the rest of the displayed images, so I assumed the losses were genuine, but It seems it was a coincidence.