When maximizing for critic score, wouldn’t that tend to get larger violation (unless lambda is negative)?
Actually, in the assignment, it hints higher gradient penalty, higher critic loss. I assume it means:
crit_loss = -crit_score = torch.mean(crit_fake_pred)-torch.mean(crit_real_pred) + c_lambda * gp
If this is true, then the sign of the gradient penalty term in the lecture should be negative, right?
Also, in the assignment, the generator loss is
gen_loss = -torch.mean(crit_fake_pred)
which doesn’t rely on the gradient penalty. I think this is because when training generator, all critic’s parameters are fixed, so gradient penalty on critic’s parameters are fixed at 0, right?
@Jack_Changfan, you are exactly right - lambda would need to be negative in the equation shown in the lecture.
The thinking at the time they put the course together was that the top priority for the lectures was to get the concepts across and leave the implementation details to the assignments. In this case, they wanted to present the basic concept that “With the gradient penalty, all you need to do is add a regularization term to your loss function.” The thinking was that lambda is a variable, so theoretically could be negative, and it might muddle the concept if the lecture spent time focusing on the sign of the gradient penalty term.
but it should rely on gradient penalty term right? As critic is passed x_hat which is a linear interpolation of fake and true samples. And while training the generator, fake samples will vary. So, the critic’s x_hat will vary.
Not sure if my understanding is right. It would be great if someone can help me understand why gradient penalty term should not be part of generator loss.
@Akanksha_Paul, remember from the videos that the 1-L continuity condition that we’re trying to address with the gradient penalty is a condition on the critic only.
As far as why this is the case, for me, the easiest way to think about this is to remember that the critic needs to consider both its predictions for fake images as well as its predictions for real images with the goal of pushing these distributions farther apart from each other. This is exactly the situation where we need the extra condition to encourage 1-L continuity.
The generator, on the other hand, really only needs to consider the critic’s predictions on its fake images. Its goal is to fool the critic with its fake images - the higher the value, the more real the critic thinks the generator’s image is, which is exactly what the generator wants.
If you want to go deeper into exactly why this works the way it does, there’s a link to the official Wasserstein GAN paper here: Build Basic GANS Week 3 Works Cited
I’m confused about the computation of the Wasserstein critic’s loss with regard to the prediction of Real examples.
With BCE, we measure how far the prediction of Real is from 1, which is the label for Real.
But for Wasserstein, there is no upper limit for the prediction, no label to measure against. If there is no upper limit, how can we measure how “wrong” the prediction of a Real example is? Is a prediction of 10 “worse” than a prediction of 20 (for a Real).
I’d like to point out that how to calculate c(x) was not explained in the videos or in the slides, hence my question.