Question about BCE loss

Hi:
In the recorded lecture [Problem with BCE Loss] in week 3, the lecturer have mentioned that the generator want to maximize the loss while the discriminator want to minimize the loss. But during real training stage such as the assignment in week 1, we actually calculate the loss of generate using
【gen_loss = criterion(pred_of_fake,torch.ones_like(pred_of_fake))】
which means we compare the predicted result of generated fake images with label “1” as we want the image being as real as possible.
I understand that in the equation of BCE loss function, the y(i) represented for the true label (which in this case should be 0, so that the generator want the loss be maximized), but I suppose when calculating the loss, y(i) actually represent for the value that we want the predicted result to get close to (which is 1), because that’s the meaning of “loss function”.
In addition, we train the generator and discriminator separately. When training the generator, we only update the parameters inside generator (based on the loss calculated in the code pasted above). And similarly, when we train the discriminator, we also consider only the parameters inside discriminator.
In both situation, the two model both want the loss to be minimized.
I’m so confused about this point. could anyone point out where my problem is? :slight_smile:

Hi @apricot ,

Welcome back! It’s been a while since you posted in the community :slight_smile:

I think that to understand this, the best way is to review the training process of the Gen and the Disc.

I will follow a slightly different order from the algorithm of the lab.

First the Disc:

  1. Take the output of the Generator (like fake images).

  2. Predict the Discriminator’s output based on these images.

  3. Take the real images

  4. Predict the Discriminator’s output based on these images.

Now we have the Disc predictions for the fake and real images. What we do next is calculate the cost of each. And here’s where the trick #1 comes:

  • fake_loss is calculated using Y=0, because the Disc wants high cost on fake images.
  • real_loss is calculated using Y=1, because the Disc wants 0 cost on real images.

Then the final loss is the average of both and we update the Disc params.

Now the Gen:

  1. Take the output of the Generator (like fake images).
  2. Predict the Discriminator’s output based on these images.

Now we calculate the loss on this prediction, and here we have trick #2: while we used Y=0 when calculating the loss on the Disc cycle, here we will calculate the loss against Y=1! Why? because we want the Disc to believe that these are real images! So:

  • fake_loss is calculated using Y=1, because we want to make believe the Disc that these are real.

Then we update the Gen params.

That’s it! Check this out and let me know if this makes sense :slight_smile:

Juan

Hi @apricot,
You have all the basic concepts right. I think what you overlooked was in the lecture, the lecturer says that BCE cost is “an average of the cost for the discriminator for
misclassifying real and fake observations” , and the generator wants to maximize this cost (the cost for the discriminator.)

In our code, we also want to calculate a loss for the generator. As you mentioned, by definition (or convention), a loss function is something we want to minimize. So, for the generator’s loss function, we want to choose a loss function where minimizing it means the generator is doing well. This is exactly what

gen_loss = criterion(pred_of_fake,torch.ones_like(pred_of_fake))

does. If the discriminator’s prediction for the fake image is close to 1 (ie close to guessing it is real), then the generator is doing a good job, and its loss will be low.

I hope this helps clear things up. As I said, it sounds like you have all the concepts down very well.

Hi Juan:
Thanks a lot for explaining the detail of the training process! And this make me confirm that my understanding of the training process is correct. However, I’m still confused about this:


Why she said the generator want to maximize the cost…?
According to the training process of generator, we use Y =1 to make the Disc believe that these generated images are real. When Y=1, and when predicted result ->1, loss->0.
Thus the generator also want to minimize the cost I suppose…

And due to this confusing problem, I have trouble understanding the formula in the following lecture :open_mouth: which involved corresponding max-min things

Check out @Wendy 's answer. I think you might find your answer there ?

One comment on this:

The fake starts very far from a real, right? In fact, in the case of an image, it may start as an image full of noise.

What happens when you run the cost of this fake-noisy image with Y=1? I would say that it will be very far from 1, so the loss will be actually close to 1.

In the next iteration the image will be a bit better, so going again to compute cost with Y=1 will still produce a high loss, but hopefully less than the previous iteration.

And so on and so on.

After a few thousand iterations, the image has improved so much that now we start seeing a loss closer to 0.

That’s how I understand Sharon’s statement of “Generator >> Maximize Cost”.

What do you think?

Hi Wendy:
Thanks for reminding me about the content I overlooked!
I’ve watched the lecture again and it seems pretty clear now :slight_smile:
Basically it is talking about the formula standing on the point of disc (for disc, gen wants to maximizes the loss), and the “maximizing” thing is mentioned for providing a better understanding on the formula of W-loss, not represent for the real gen training stage

1 Like

Thanks for your opinion! I’ve understand where I goes wrong :grinning:
And just as what I reply to Wendy, the “maximizes” mentioned in that slide doesn’t means what we do in training stage, it just used to help us understand the concept of both BCE loss and W-loss

1 Like