SRGAN: Doubt in Loss Functions

Elemento · January 27, 2022, 1:45pm

Hey,
I am having some small issues with the loss functions that we are using in SRGAN.

In the theory part for the loss functions, in the content loss specifically, they have considered the output from the 4th convolutional layer after the 5th max-pooling layer from the VGG19 network in order to compute the content loss. But in the code part, I can see that we are using the final output from the VGG19 network instead of output from the intermediate layer (i.e., 4th convolutional network) in order to compute the content loss. In the theory, is it mentioned just for reference, or am I missing something out in the code of the function for the content loss?

One more doubt, (I may have gone mad here), in the adversarial loss, while defining the target, we have defined:

if is_real: target = torch.zeros_like(x)
else: target = torch.ones_like(x)

However, when we are computing the loss for the real images, we keep the target vector as all-ones, and when we are computing the loss for the fake images, we keep the target vector as all-zeroes (considering the discriminator). However, if we consider the code provided, and the formulation of d_loss, then it’s completely opposite. For real images, we are using the target vector as all zeroes, and for fake images, we are using the target vector as all ones.

Am I missing something very important here, or is the code provided for computing adv_loss, g_loss, and d_loss incorrect?

Regards,
Elemento

Elemento · February 4, 2022, 11:33am

Hey @mentor,
Can you please see to this?

Wendy · February 4, 2022, 9:46pm

Good catch, @Elemento! That adv_loss target definition looks backwards to me, too. And the places that call it look mixed up, too. The generator’s call, seems backwards, too, which happens to counter-act the backwards-ness in adv_loss, but the discriminator isn’t backwards. Which means that the generator and discriminator are both working towards the same target - which isn’t very adversarial:

g_loss calls: self.adv_loss(fake_preds_for_g, False)
d_loss calls: self.adv_loss(fake_preds_for_d, False)

I’ll submit a bug report to have the developers look at fixing this.
I’ll also ask them about your first point, which, I agree, the implementation looks inconsistent with the theory explanation.

Elemento · February 5, 2022, 5:39am

Hey @Wendy,
Thanks a lot!

Elemento · February 5, 2022, 11:34am

Hey @Wendy,
I modified my loss function to the way it should be according to me, and I trained my SRGAN with that, but in the middle of the training, the generator stopped training. Can you please help me to find out the mistake I am doing in my formulation? I have attached the screenshots for your reference.

In the first image, you can see my formulation. In the second image, you can see how the generator was training as it should have been up to 57000 steps, and then suddenly it stopped training. And I guess when the generator is not being trained, the discriminator also won’t be trained, as can be seen in the third image.

Regards,
Elemento

Wendy · February 5, 2022, 9:50pm

Hmm. That’s odd. It looks like the place you start seeing the problem is in train_srresnet, which doesn’t create a Loss object, so it doesn’t use the code you changed in your screenshot (forward and adv_loss). It’s train_srgan that uses those. train_srresnet only uses the static method Loss.img_loss, which it doesn’t look like you changed.

Did you change anything else besides what you show in the screenshot? If not, the only thing I can think to suggest is to start fresh to make sure you have a totally clean environment.

Also, if you’re specifically trying to test your changes to adv_loss and forward, I’d suggest saving yourself some time by essentially skipping the srresnet portion of the training by calling train_srresnet with a relatively small number of steps, like 2000:
train_srresnet(generator, dataloader, device, lr=1e-4, total_steps=2000, display_step=1000)
Then you can focus on the results from train_srgan which uses your changes.

`

Elemento · February 6, 2022, 9:31am

Hey @Wendy,
I only changed the loss functions which can be seen in the first image, the rest of everything is the same as before. If you want, you can share your Kaggle username with me, and I can give you access to my notebook on Kaggle, and you can check for yourself.

And yes, I will surely try to check my changes in loss functions, as you said in your reply.

Wendy · February 7, 2022, 11:51pm

OK, since the code you changed shouldn’t have been run yet at the time of the issue you ran into, that shouldn’t be the cause. Also, I know I was able to run train_srresnet() for the full 100000 steps without any issue (using colab).

So, it seems like something in the state of your environment was different than mine. That’s why I suggested starting a clean run, with everything reset. I didn’t realize at the time that you were using kaggle, so that another difference. I don’t see anything obvious that would cause the issue, but out of curiosity, you might add a print statement after the line that initializes has_autocast to print which version of pytorch you’re using. In colab, I’m using 1.10.

Elemento · February 8, 2022, 3:41am

Hey @Wendy,
The version Kaggle has is 1.9. I am assuming this won’t be the issue then, since the autocast is being used in my code as well?

Additionally, if possible, do let me know how to run a notebook on collab without being active. For instance, when I run a notebook on the collab, I have to, again and again, interact with it in a regular interval, otherwise, the kernel gets disconnected. Is there a hack around it, so that I can leave my notebooks running on it for the night?

Regards,
Elemento

Wendy · February 8, 2022, 11:49pm

Hi @Elemento,
Unfortunately, I don’t have any tips for how to run on colab without being active. I thought it was brilliant, actually, that you thought to use kaggle for exactly that reason that you can leave it running.

Elemento · February 10, 2022, 2:43pm

Oh, I am glad you liked it @Wendy . By the way, what about the Kaggle version? Do you think it’s the issue that we are looking for?

Wendy · February 10, 2022, 10:17pm

I doubt that Kaggle version should be an issue. Did you ever try just re-running it after a clean start to see if it’s repeatable? I know that will take a while to run, and I’m not sure it’s worth spending much time on this, tbh, since train_srresnet() doesn’t seem to add much improvement as far as I can tell. (Maybe because of the first issue you identified). So it should be fine to just cut down on the number of steps for that one and then you can see how your change affects train_srgan

Elemento · February 11, 2022, 7:17am

Hey @Wendy,
Actually, my GPU runtime got over on kaggle, so I was waiting for the next week. Nonetheless, today I have run it on CPU and will let you know the outcome, once it gets executed

Elemento · February 12, 2022, 10:45am

Hey @Wendy,
I ran the notebook on the CPU with a lesser number of epochs, and both functions are running smoothly as expected. I think this is enough evidence to support that my changes were correct in the loss functions. The reason as to why the training is collapsing in train_srresnet and eventually in train_srgan, is something else, I guess?

Regards,
Elemento

Wendy · February 14, 2022, 7:38pm

Great! You were brave trying to run this with CPU. It must have been incredibly slow!

Definitely your changes didn’t cause the problem since the problem happened before it even got to the point of running the code with your changes.

Elemento · February 15, 2022, 5:13am

Cool then, I will wait for the notebook to be updated, and then I will run the notebook once again.

Thanks a lot for your time @Wendy

Topic		Replies	Views
Why we use torch.zeros_like and torch.one_like when calculating the loss Build Basic Generative Adversarial Networks week-1	5	417	May 15, 2024
Week 1 Programming Assignment - Conceptual Question Build Basic Generative Adversarial Networks week-1	8	735	August 25, 2021
Assignment C3W3 how to calculate adversarial_loss Apply Generative Adversarial Networks week-3	1	14	October 10, 2024
Confusion with WGAN-GP Loss equation for the Critic Build Basic Generative Adversarial Networks week-3	5	196	September 29, 2023
Why is the Generator Loss in WGAN negative mean of the predicted image Build Basic Generative Adversarial Networks week-3	4	452	September 26, 2022

SRGAN: Doubt in Loss Functions

Related topics