Hey @GuyMerf,
According to me, the error is in the inputs of the criterion function, as a result of which, the outputs have different dimensions as well. If you consider the criterion, then it is basically finding the loss with the help of predicted labels and ground labels.

Here, the predicted labels are predict_fake and predict_real. Now, if we think about the ground labels, we want all the predictions corresponding to fake images as 0s, and hence, the true label for fake images can be torch.zeros_like(predict_fake). The best part about this is that you donâ€™t have to worry about the dimensions at all. Similarly, for the real images, the true label can be torch.ones_like(predict_real).

Please @Elemento why do we have to detach the generator to calculate the disc loss and why not for the generatorâ€™s loss? I am not sure to understand that part.

The reason here is very simple. First, allow me to state a very trivial fact, i.e., we are finding the generator loss to update the weights of the generator and the discriminator loss to update the weights of the discriminator.

Once we have established this fact, the question is answered itself. When we are calculating the disc loss, we want to update the discriminator only, and hence, we are using a tensor that is not attached to the computation graph (something which is the job of the detach method). But when we are calculating the generatorâ€™s loss, we want to update the generator, and at this point, if you use the detach method, then the generatorâ€™s weights wonâ€™t update, and the generator wonâ€™t train.

In summary, we use the detach method when we wanted to make sure that the generatorâ€™s weights are not updated.

I think itâ€™s worth going into a little more detail there. Hereâ€™s another thread that discusses this issue.

Note that the situation is fundamentally asymmetric:

When we train the generator, we need the gradients for the discriminator, since the loss is defined by the output of the discriminator, right? So by the Chain Rule, the generator gradients contain the discriminator gradients as factors. But then we are careful not to apply those gradients to the discriminator: we only apply the gradients for the generator in that case. Then we always discard any previous gradients at the beginning of any training cycle.

In the case of training the discriminator, the gradients do not include the generator gradients, so we literally donâ€™t need them. We could compute them and they will be thrown away, so itâ€™s not a correctness issue. Itâ€™s a performance issue: computing gradients is expensive (doing finite differences), so why do it in the case you know you donâ€™t need them? Why waste the cpu and memory when youâ€™re just going to throw the gradients away?