Week 1 Assignment: RuntimeError

Hi,
please i am stuck. not able to find the issue here.

    noise_vector = get_noise(num_images, z_dim, device)
    fake = gen(noise_vector).detach()
    predict_fake = disc(fake)
    loss_fake = criterion(predict_fake, torch.zeros(num_images, z_dim))
    predict_real = disc(real)
    loss_real = criterion(predict_real, real)
    disc_loss = (loss_fake + loss_real)/2.0

here is the error. i am not familiar with Pytorch

RuntimeError                              Traceback (most recent call last)
<ipython-input-45-97fd1ea584a3> in <module>
     71             break
     72 
---> 73 test_disc_reasonable()
     74 test_disc_loss()
     75 print("Success!")

<ipython-input-45-97fd1ea584a3> in test_disc_reasonable(num_images)
     23     criterion = torch.mul # Multiply
     24     real = torch.zeros(num_images, 10)
---> 25     assert torch.all(torch.abs(get_disc_loss(gen, disc, criterion, real, num_images, z_dim, 'cpu').mean() - 5) < 1e-5)
     26 
     27     gen = torch.ones_like

<ipython-input-44-1c34c7deaebb> in get_disc_loss(gen, disc, criterion, real, num_images, z_dim, device)
     37     predict_real = disc(real)
     38     loss_real = criterion(predict_real, real)
---> 39     disc_loss = (loss_fake + loss_real)/2.0
     40     #### END CODE HERE ####
     41     return disc_loss

RuntimeError: The size of tensor a (64) must match the size of tensor b (10) at non-singleton dimension 1

Hey @GuyMerf,
According to me, the error is in the inputs of the criterion function, as a result of which, the outputs have different dimensions as well. If you consider the criterion, then it is basically finding the loss with the help of predicted labels and ground labels.

Here, the predicted labels are predict_fake and predict_real. Now, if we think about the ground labels, we want all the predictions corresponding to fake images as 0s, and hence, the true label for fake images can be torch.zeros_like(predict_fake). The best part about this is that you don’t have to worry about the dimensions at all. Similarly, for the real images, the true label can be torch.ones_like(predict_real).

In summary, try out this code:

    predict_fake = disc(fake)
    loss_fake = criterion(predict_fake, torch.zeros_like(predict_fake))
    predict_real = disc(real)
    loss_real = criterion(predict_real, torch.ones_like(predict_real))
    disc_loss = (loss_fake + loss_real) / 2.0
2 Likes

Hi @Elemento,
Thanks! it works. I think I was not getting the root of that part and you’ve explained it to me well.
Thank you once again!

1 Like

Happy to help @GuyMerf :innocent:

1 Like

Please @Elemento why do we have to detach the generator to calculate the disc loss and why not for the generator’s loss? I am not sure to understand that part.

1 Like

The reason here is very simple. First, allow me to state a very trivial fact, i.e., we are finding the generator loss to update the weights of the generator and the discriminator loss to update the weights of the discriminator.

Once we have established this fact, the question is answered itself. When we are calculating the disc loss, we want to update the discriminator only, and hence, we are using a tensor that is not attached to the computation graph (something which is the job of the detach method). But when we are calculating the generator’s loss, we want to update the generator, and at this point, if you use the detach method, then the generator’s weights won’t update, and the generator won’t train.

In summary, we use the detach method when we wanted to make sure that the generator’s weights are not updated.

Regards,
Elemento

1 Like

I think it’s worth going into a little more detail there. Here’s another thread that discusses this issue.

Note that the situation is fundamentally asymmetric:

When we train the generator, we need the gradients for the discriminator, since the loss is defined by the output of the discriminator, right? So by the Chain Rule, the generator gradients contain the discriminator gradients as factors. But then we are careful not to apply those gradients to the discriminator: we only apply the gradients for the generator in that case. Then we always discard any previous gradients at the beginning of any training cycle.

In the case of training the discriminator, the gradients do not include the generator gradients, so we literally don’t need them. We could compute them and they will be thrown away, so it’s not a correctness issue. It’s a performance issue: computing gradients is expensive (doing finite differences), so why do it in the case you know you don’t need them? Why waste the cpu and memory when you’re just going to throw the gradients away?

1 Like

Thanks a lot, @paulinpaloalto. I also learned new things from your answer :innocent: