Wk1 Programming Assignment

It’s not that you freeze the weights of the “other” model: it’s that you don’t update them. You may generate gradients for the other model, but you simply don’t apply them.

The situation is fundamentally asymmetric: when you train the discriminator, you literally don’t depend on the gradients for the generator, so you can detach the generator so that the gradients don’t get created. It’s not that creating them is incorrect, because we would discard them anyway. The point is that it’s just wasted compute, so you can save the effort.

In the case that you are training the generator, you require the gradients for the discriminator, since the generator’s gradients include the discriminator gradients as part of the Chain Rule calculations.

Here’s another thread that expresses the same ideas above, but with a bit different wording, so maybe that’s also worth a look.