SRGAN: Doubt in Loss Functions

Hey,
I am having some small issues with the loss functions that we are using in SRGAN.

In the theory part for the loss functions, in the content loss specifically, they have considered the output from the 4th convolutional layer after the 5th max-pooling layer from the VGG19 network in order to compute the content loss. But in the code part, I can see that we are using the final output from the VGG19 network instead of output from the intermediate layer (i.e., 4th convolutional network) in order to compute the content loss. In the theory, is it mentioned just for reference, or am I missing something out in the code of the function for the content loss?

One more doubt, (I may have gone mad here), in the adversarial loss, while defining the target, we have defined:

if is_real: target = torch.zeros_like(x)
else: target = torch.ones_like(x)

However, when we are computing the loss for the real images, we keep the target vector as all-ones, and when we are computing the loss for the fake images, we keep the target vector as all-zeroes (considering the discriminator). However, if we consider the code provided, and the formulation of d_loss, then it’s completely opposite. For real images, we are using the target vector as all zeroes, and for fake images, we are using the target vector as all ones.

Am I missing something very important here, or is the code provided for computing adv_loss, g_loss, and d_loss incorrect?

Regards,
Elemento

Hey @mentor,
Can you please see to this?

Good catch, @Elemento! That adv_loss target definition looks backwards to me, too. And the places that call it look mixed up, too. The generator’s call, seems backwards, too, which happens to counter-act the backwards-ness in adv_loss, but the discriminator isn’t backwards. Which means that the generator and discriminator are both working towards the same target - which isn’t very adversarial:

g_loss calls: self.adv_loss(fake_preds_for_g, False)
d_loss calls: self.adv_loss(fake_preds_for_d, False)

I’ll submit a bug report to have the developers look at fixing this.
I’ll also ask them about your first point, which, I agree, the implementation looks inconsistent with the theory explanation.

1 Like

Hey @Wendy,
Thanks a lot!

Hey @Wendy,
I modified my loss function to the way it should be according to me, and I trained my SRGAN with that, but in the middle of the training, the generator stopped training. Can you please help me to find out the mistake I am doing in my formulation? I have attached the screenshots for your reference.

In the first image, you can see my formulation. In the second image, you can see how the generator was training as it should have been up to 57000 steps, and then suddenly it stopped training. And I guess when the generator is not being trained, the discriminator also won’t be trained, as can be seen in the third image.

Regards,
Elemento

Hmm. That’s odd. It looks like the place you start seeing the problem is in train_srresnet, which doesn’t create a Loss object, so it doesn’t use the code you changed in your screenshot (forward and adv_loss). It’s train_srgan that uses those. train_srresnet only uses the static method Loss.img_loss, which it doesn’t look like you changed.

Did you change anything else besides what you show in the screenshot? If not, the only thing I can think to suggest is to start fresh to make sure you have a totally clean environment.

Also, if you’re specifically trying to test your changes to adv_loss and forward, I’d suggest saving yourself some time by essentially skipping the srresnet portion of the training by calling train_srresnet with a relatively small number of steps, like 2000:
train_srresnet(generator, dataloader, device, lr=1e-4, total_steps=2000, display_step=1000)
Then you can focus on the results from train_srgan which uses your changes.

`

Hey @Wendy,
I only changed the loss functions which can be seen in the first image, the rest of everything is the same as before. If you want, you can share your Kaggle username with me, and I can give you access to my notebook on Kaggle, and you can check for yourself.

And yes, I will surely try to check my changes in loss functions, as you said in your reply.

OK, since the code you changed shouldn’t have been run yet at the time of the issue you ran into, that shouldn’t be the cause. Also, I know I was able to run train_srresnet() for the full 100000 steps without any issue (using colab).

So, it seems like something in the state of your environment was different than mine. That’s why I suggested starting a clean run, with everything reset. I didn’t realize at the time that you were using kaggle, so that another difference. I don’t see anything obvious that would cause the issue, but out of curiosity, you might add a print statement after the line that initializes has_autocast to print which version of pytorch you’re using. In colab, I’m using 1.10.

Hey @Wendy,
The version Kaggle has is 1.9. I am assuming this won’t be the issue then, since the autocast is being used in my code as well?

Additionally, if possible, do let me know how to run a notebook on collab without being active. For instance, when I run a notebook on the collab, I have to, again and again, interact with it in a regular interval, otherwise, the kernel gets disconnected. Is there a hack around it, so that I can leave my notebooks running on it for the night?

Regards,
Elemento

Hi @Elemento,
Unfortunately, I don’t have any tips for how to run on colab without being active. I thought it was brilliant, actually, that you thought to use kaggle for exactly that reason that you can leave it running. :brain:

1 Like

Oh, I am glad you liked it @Wendy :laughing:. By the way, what about the Kaggle version? Do you think it’s the issue that we are looking for?

I doubt that Kaggle version should be an issue. Did you ever try just re-running it after a clean start to see if it’s repeatable? I know that will take a while to run, and I’m not sure it’s worth spending much time on this, tbh, since train_srresnet() doesn’t seem to add much improvement as far as I can tell. (Maybe because of the first issue you identified). So it should be fine to just cut down on the number of steps for that one and then you can see how your change affects train_srgan

Hey @Wendy,
Actually, my GPU runtime got over on kaggle, so I was waiting for the next week. Nonetheless, today I have run it on CPU and will let you know the outcome, once it gets executed :innocent:

Hey @Wendy,
I ran the notebook on the CPU with a lesser number of epochs, and both functions are running smoothly as expected. I think this is enough evidence to support that my changes were correct in the loss functions. The reason as to why the training is collapsing in train_srresnet and eventually in train_srgan, is something else, I guess?

Regards,
Elemento

Great! You were brave trying to run this with CPU. It must have been incredibly slow!

Definitely your changes didn’t cause the problem since the problem happened before it even got to the point of running the code with your changes.

Cool then, I will wait for the notebook to be updated, and then I will run the notebook once again.

Thanks a lot for your time @Wendy :blush: