Alternately training the Generator and the Discriminator

This question refers to functions from Course-1–Week-1’s programming assignment.

When calculating disc_loss in the get_disc_loss() function, we detach the output of the generator to ensure that only the discriminator is updated.

However, when calculating gen_loss in get_gen_loss(), we do not freeze the weights of the discriminator. This means that in this part, both the generator as well as the discriminator weights are updated.

Is there any reason why this is so? Wouldn’t this lead to the discriminator learning faster (since it is updated in both cases, whereas the generator in only one). Especially since learning the generator is considered harder.

Another point to add to the above question:
In the second case (i.e. get_gen_loss()), wouldn’t it also be wrong to train the discriminator since we are providing the wrong values for the true labels.

Sorry for the trouble. I understood where the problem is. After doing gen_loss.backward(), the update step is gen_opt.step(). And since gen_opt only contains parameters of the generator, only the generator is updated during this step.

Yes, that’s the point: we only update the one that is being trained. The code always is careful to discard any previous gradients when it starts each new training cycle.

Note that the situation is fundamentally asymmetric: when we train the discriminator, we literally don’t need the gradients of the generator, so we save work by not computing them. But in the case of training the generator, the gradients are w.r.t. the cost, which is the output of the discriminator, right? So we have no choice but to compute the gradients of the discriminator (remember how the Chain Rule works). But we are careful not to apply the discriminator gradients in that case.


The other point worth making is that the “detach” is only a performance issue, not a correctness issue, for the reason that you pointed out: the code only applies the gradients for the module that is it training in any case. But computing gradients is costly from a cpu and memory standpoint, so it’s better not to do it in cases in which you don’t actually need them.

1 Like