Alternately training the Generator and the Discriminator

shreeshail · November 25, 2021, 9:42am

This question refers to functions from Course-1–Week-1’s programming assignment.

When calculating disc_loss in the get_disc_loss() function, we detach the output of the generator to ensure that only the discriminator is updated.

However, when calculating gen_loss in get_gen_loss(), we do not freeze the weights of the discriminator. This means that in this part, both the generator as well as the discriminator weights are updated.

Is there any reason why this is so? Wouldn’t this lead to the discriminator learning faster (since it is updated in both cases, whereas the generator in only one). Especially since learning the generator is considered harder.

shreeshail · November 25, 2021, 9:48am

Another point to add to the above question:
In the second case (i.e. get_gen_loss()), wouldn’t it also be wrong to train the discriminator since we are providing the wrong values for the true labels.

shreeshail · November 25, 2021, 9:53am

Sorry for the trouble. I understood where the problem is. After doing gen_loss.backward(), the update step is gen_opt.step(). And since gen_opt only contains parameters of the generator, only the generator is updated during this step.

paulinpaloalto · November 25, 2021, 9:15pm

Yes, that’s the point: we only update the one that is being trained. The code always is careful to discard any previous gradients when it starts each new training cycle.

Note that the situation is fundamentally asymmetric: when we train the discriminator, we literally don’t need the gradients of the generator, so we save work by not computing them. But in the case of training the generator, the gradients are w.r.t. the cost, which is the output of the discriminator, right? So we have no choice but to compute the gradients of the discriminator (remember how the Chain Rule works). But we are careful not to apply the discriminator gradients in that case.

paulinpaloalto · November 26, 2021, 3:58pm

The other point worth making is that the “detach” is only a performance issue, not a correctness issue, for the reason that you pointed out: the code only applies the gradients for the module that is it training in any case. But computing gradients is costly from a cpu and memory standpoint, so it’s better not to do it in cases in which you don’t actually need them.

Topic		Replies	Views
Wk1 Programming Assignment Build Basic Generative Adversarial Networks week-1	3	30	October 7, 2024
Stop Discriminator weights update while update Generator Build Basic Generative Adversarial Networks week-1	1	646	August 13, 2022
Why should we detach the discriminators input ?! Build Basic Generative Adversarial Networks week-4	4	1572	November 30, 2022
Detach() used in Assignment 4 Build Basic Generative Adversarial Networks week-4	5	598	November 13, 2022
GAN.C1.W2.Assignment.Training Build Basic Generative Adversarial Networks week-2 , week-3	2	559	October 31, 2022

Alternately training the Generator and the Discriminator

Related topics