Detach() used in Assignment 4

The one other point worth making here is that this is just a performance issue, not a correctness issue. The training code is careful to always start each training step by zeroing the gradients, so that we don’t accidentally include previously computed gradients, and to only apply the gradients to the model that is actually being trained in that step. Of course we alternate between training the generator and training the discriminator. When we train the discriminator, we don’t need the gradients of the generator and computing them is expensive, so why not save the compute cycles. But also note that the situation is fundamentally asymmetric: when we train the generator, we need the gradients of the discriminator, because the cost by definition is computed using the output of the discriminator. So in that case, we can’t do the “detach” and depend on the logic I referred to earlier which is careful only to apply the gradients of the model we are actually training (the generator in that example).

1 Like