Hi,
in the week 1 “putting it all together” lesson, the lecturer says that if the discriminator is too good, then the model can’t improve.

Based on what we’ve learnt so far, when training the generator you freeze the discriminator and set labels to 1 (real). If the discriminator is very good, then the loss would be high right? (loss = ylog(h)+... is high when y is 1 and h is really small) if loss is high then gradients should be strong, why this means the model can’t improve?

I think it means that if the discriminator is too good the generator can’t fool it so its not of a good use. There is a training phase for the discriminator too, so its not static and of it keeps improving and the generator can’t catch up with it, then its not of a good use.

Hi Paul, thanks for your reply! sorry I’m still a bit confused, my understanding of GAN is limited to week1 assignment1.

But the discriminator cost is at a minimum: the values are all 0, which is as low as they can go

this is true when you train the discriminator, if prediction is low, i.e. 0.001 and label is 0. But we detached the generator output so no gradient will be backpropagated to it anyway.

However, when updating the generator, we pass the generator output to the discriminator and manually set the label to be 1. If the prediction is 0.001 then the loss should be -log(0.001) which is large, there should be “plenty” gradient left when it backpropagated to the generator.

Good question, @Yu_Hou! In general, from an implementation perspective, we can have troubles both when the gradient is very small (vanishing gradient), and when the gradient is very large (exploding gradients).

In our example, if the discriminator was perfect, and returned 0 for all fake images, then, of course, log(0) is undefined, which is obviously a problem, but even if the discriminator returns a very small prediction, and the log gets extremely large, then we can end up with unpredictable results. Maybe in theory, this might work, but in practice, we run into issues due to rounding errors and precision when dealing with large numbers, as well as amplified effects of the randomness of the images our generator has generated. Imagine that if the discriminator assigned slightly different very small predictions to each of the images in a batch, this would lead to a large range of log() values, so backprops calculated based on the average for each batch may not consistently (or quickly) get us to a good solution.

If you want to get deep into this, you can take a look at this paper: https://arxiv.org/pdf/1701.04862.pdf, in particular, Section 2.2.2 The -log D Alternative