Why should we detach the discriminators input ?!

I would like to add more details to the answer above.

When we train the Discriminator, we don’t want to track operations for the Generator, since we are not going to update it or use its gradients. So we can speed up the training by detaching (from the computational graph) the output of the Generator.

When we train the Generator, we need to calculate the gradients of the Discriminator, but we won’t update it. Note that the update is done by calling optimizer.step() and each model has its own optimizer, whereas backward just calculates gradients without updating.

Here is an illustration of the generator training step. (Note that we pass true label to the discriminator to calculate gradients towards real data)

You can read this thread to gain more intuition behind this.

8 Likes