I think it’s worth going into a little more detail there. Here’s another thread that discusses this issue.
Note that the situation is fundamentally asymmetric:
When we train the generator, we need the gradients for the discriminator, since the loss is defined by the output of the discriminator, right? So by the Chain Rule, the generator gradients contain the discriminator gradients as factors. But then we are careful not to apply those gradients to the discriminator: we only apply the gradients for the generator in that case. Then we always discard any previous gradients at the beginning of any training cycle.
In the case of training the discriminator, the gradients do not include the generator gradients, so we literally don’t need them. We could compute them and they will be thrown away, so it’s not a correctness issue. It’s a performance issue: computing gradients is expensive (doing finite differences), so why do it in the case you know you don’t need them? Why waste the cpu and memory when you’re just going to throw the gradients away?