In the week 1 programming assignment, I only called detach() in get_disc_loss(), which is consistent with the hints in the lab.
My question is do we need to freeze the discriminator’s weights as well in get_gen_loss(). In the course video, it is clearly mentioned that when you train the generator, backpropagation should only update generator weights, and discriminator weights should be freezed.
I didn’t freeze the weights and I still get 100% score, is this expected?
If I really want to freeze the weight in this case, how should i do that? My understanding is detach() on the discriminator won’t work because it “detaches” the compute graph, hence it’s parent nodes including the generator will also be detached.
It’s not that you freeze the weights of the “other” model: it’s that you don’t update them. You may generate gradients for the other model, but you simply don’t apply them.
The situation is fundamentally asymmetric: when you train the discriminator, you literally don’t depend on the gradients for the generator, so you can detach the generator so that the gradients don’t get created. It’s not that creating them is incorrect, because we would discard them anyway. The point is that it’s just wasted compute, so you can save the effort.
In the case that you are training the generator, you require the gradients for the discriminator, since the generator’s gradients include the discriminator gradients as part of the Chain Rule calculations.
Here’s another thread that expresses the same ideas above, but with a bit different wording, so maybe that’s also worth a look.
Thanks Paul. I see. So the optimizer has specified only the gen.parameters() are needed to optimize, so when we call step(), it will only update gen.parameters().
Yes, that defines the optimization function, but it’s a little more complicated than that. Now look at the detailed logic, including the calls to the zero_grad() method and the step() method.