Hello Everyone, this is my first question in the community.
so, lets consider these are the back prop equations:
dL/ dw = (dz/dw) * (dA/dz) * (dL/dA) ---- eq (1)
z = linear layer
A = activation
L = (-ylog(y_hat) - (1 - y)log(1 - y_hat))
as generator loss looks at the first part,
L = -ylog(y_hat).
dL/dA = -1/(y_hat)
if y_hat = 0.01, dL/dA = -100.
if we plug this in eq(1), dL/dA is a really large update. how can this cause vanishing gradients ?
According the instructor, as the discriminator learns faster than generator, the loss function becomes more flat (giving 0’s as output for fake ones), which is not at all helpful for the generator to improve, as the loss function is flat, the gradients update is often negligible. I am seeking a proof for this explanation.
You raise some really interesting points here. It’s been at least 4 years since I watched the lectures in this course. I assume you are talking about the material in GANs C1 W3 where Prof Zhou introduces us to Wasserstein Loss as an alternative to BCE. I went back and watched the first three lectures in Week 3 to refresh my memory and try to understand the implications for the points you raise. She explains two different types of problems with BCE loss: Mode Collapse and then some variation of Vanishing Gradients.
Here’s the graph she shows at time offset 3:15 in the third lecture in Week 3:
But my interpretation is that is really talking about the loss for the discriminator and it’s in the case she posits that the discriminator is already too far ahead of the generator and can give almost 0 (fake) as \hat{y} for the fake images and almost 1 (real) for the sample real images. So, yes, it would be the case that the discriminator would not be able to learn much more, but the problem is that it’s already too good, right? The real point is how (or even whether) the generator can learn in that situation. You rightly point out that the generator’s loss function is different: it wants the output of the discriminator to be close to 1 (real) for its fake images.
Here’s a thread which shows the derivation of \displaystyle \frac {dL}{dz} and here’s how it comes out:
\displaystyle \frac {dL}{dz} = a - y
Of course that is only for the output layer and we are dealing with multi-layer networks here, so things are a bit more complicated. When you compute \displaystyle \frac {dL}{dw} even just for the output layer, you’ll get a factor of a^{[N-1]} for the activation output of the previous layer. But without going further in that chain of thought (“chain rule” pun intended ), you’re right that a - y will be a relatively big number (in absolute value), meaning close to -1 (because a is close to 0 and y is close to 1). So I agree that the gradients should not end up being vanishingly small in this case.
So I see your point and we’ll need to think more about this. We’re still early in the GANs specialization at this point: we’ve got a total of 3 courses ahead of us and at least in my (admittedly not very up-to-date) memory, I don’t remember W loss ever being used again in the rest of this course or the other two courses. It’s BCE all the way other than this week. So the impression is that BCE loss does well in a lot of cases and W loss is just another tool in our toolbox in case we hit a situation in which BCE fails.
The other avenue here is that maybe we get lucky and we can find someone else to respond who knows more about GANs than I do. I remember some other mentors who were quite knowledgable on this and will try pinging some of them.
Based on the logs in the loss function, I think you’re asking about BCE loss, which is reviewed in the early videos of week 3.
Your reasoning seems sound, except I think you are looking at the wrong part of the equation for the case of generated images. y is the label that tells us if the image is actually real or fake. fake = 0, so, when the image is fake, the first part of the equation = 0 * log(y_hat) = 0. It’s the second half of the equation you want to be looking at for generated images (1-0) = 1. Also remember that the activation function we typically use for BCE loss is the sigmoid function for true/false classification like this.
Hope this helps. Let me know if I misunderstood something.
We want the generator to produce as real images as possible. Our loss will be “how far away are the generated images, from being classified as real by the discriminator”
for ex → 0.43 real | (-ylog(y_hat)) equates to 1.(Log(0.43))
In the code, we can observe that, we are passing an array of ones. (To check how far away are they from cheating the descriminator)
The second part of the bce loss (1 - y)log (1 - y_hat), will not be used, we only want to compare our generated images to see how far away they are from real ones distribution.
Which resulted in:
L = -ylog(y_hat)
dL/dA = -1/(y_hat)
Corrections in my explanation are always welcome, I have just started exploring this area.
I focused on the wrong part. Of course you’re right that when we are calculating generator loss, we’re targeting the generator’s goal of getting a prediction of 1 for it’s generated image. So I now see why you were using that part of the BCE function - and the result makes sense - if the generator wants a 1, but the discriminator is good, it will predict something close to 0 (fake), which will give a large loss, which makes sense.
The important part is that the activation function we use for the output is the sigmoid function, with its very flat slopes at both ends of its curve. Its derivative approaches 0 at either end. So, when we apply the chain rule, we’re getting into vanishing gradient territory as the loss gets very large.
For Binary Cross-Entropy (BCE) loss to function properly, we need to apply a squeeze on the output from 0 to 1, which requires using a sigmoid activation. However, sigmoid has a very flat slope at the ends of its curve, leading to the vanishing gradient problem. In contrast, the Earth Mover’s Distance (EMD) or Wasserstein loss doesn’t have this requirement, helping to mitigate the vanishing gradient issue.