How does bce loss induce vanishing gradients in GANs

paulinpaloalto · September 29, 2024, 4:10pm

You raise some really interesting points here. It’s been at least 4 years since I watched the lectures in this course. I assume you are talking about the material in GANs C1 W3 where Prof Zhou introduces us to Wasserstein Loss as an alternative to BCE. I went back and watched the first three lectures in Week 3 to refresh my memory and try to understand the implications for the points you raise. She explains two different types of problems with BCE loss: Mode Collapse and then some variation of Vanishing Gradients.

Here’s the graph she shows at time offset 3:15 in the third lecture in Week 3:

But my interpretation is that is really talking about the loss for the discriminator and it’s in the case she posits that the discriminator is already too far ahead of the generator and can give almost 0 (fake) as \hat{y} for the fake images and almost 1 (real) for the sample real images. So, yes, it would be the case that the discriminator would not be able to learn much more, but the problem is that it’s already too good, right? The real point is how (or even whether) the generator can learn in that situation. You rightly point out that the generator’s loss function is different: it wants the output of the discriminator to be close to 1 (real) for its fake images.

Here’s a thread which shows the derivation of \displaystyle \frac {dL}{dz} and here’s how it comes out:

\displaystyle \frac {dL}{dz} = a - y

Of course that is only for the output layer and we are dealing with multi-layer networks here, so things are a bit more complicated. When you compute \displaystyle \frac {dL}{dw} even just for the output layer, you’ll get a factor of a^{[N-1]} for the activation output of the previous layer. But without going further in that chain of thought (“chain rule” pun intended ), you’re right that a - y will be a relatively big number (in absolute value), meaning close to -1 (because a is close to 0 and y is close to 1). So I agree that the gradients should not end up being vanishingly small in this case.

So I see your point and we’ll need to think more about this. We’re still early in the GANs specialization at this point: we’ve got a total of 3 courses ahead of us and at least in my (admittedly not very up-to-date) memory, I don’t remember W loss ever being used again in the rest of this course or the other two courses. It’s BCE all the way other than this week. So the impression is that BCE loss does well in a lot of cases and W loss is just another tool in our toolbox in case we hit a situation in which BCE fails.

The other avenue here is that maybe we get lucky and we can find someone else to respond who knows more about GANs than I do. I remember some other mentors who were quite knowledgable on this and will try pinging some of them.

Thanks for the interesting question!

Topic		Replies	Views
Problem with BCE loss video question Build Basic Generative Adversarial Networks week-3	18	539	February 27, 2022
Why no wasserstein loss in cGANS? Build Basic Generative Adversarial Networks week-1	1	525	January 21, 2022
Question about loss in GAN Apply Generative Adversarial Networks week-3	1	257	April 9, 2024
Question about BCE loss Build Basic Generative Adversarial Networks week-3	7	319	November 15, 2022
C1W4A_Build_a_Conditional_GAN why are we using BCE instead of critic approach in WGAN_GP? Build Basic Generative Adversarial Networks week-4	2	290	February 20, 2024

How does bce loss induce vanishing gradients in GANs

Related topics