RuntimeError in C3W2B during training in last cell

maevil · March 17, 2023, 7:20pm

Hello again,

I have an issue running the C3W2B assignment. I can’t run the last cell and I get the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [11], in <cell line: 67>()
     61                     torch.save({'gen': gen.state_dict(),
     62                         'gen_opt': gen_opt.state_dict(),
     63                         'disc': disc.state_dict(),
     64                         'disc_opt': disc_opt.state_dict()
     65                     }, f"pix2pix_{cur_step}.pth")
     66             cur_step += 1
---> 67 train()

Input In [11], in train(save_model)
     37 ### Update generator ###
     38 gen_opt.zero_grad()
---> 39 gen_loss = get_gen_loss(gen, disc, real, condition, adv_criterion, recon_criterion, lambda_recon)
     40 gen_loss.backward() # Update gradients
     41 gen_opt.step() # Update optimizer

Input In [8], in get_gen_loss(gen, disc, real, condition, adv_criterion, recon_criterion, lambda_recon)
     25 fake = gen(condition)
     26 fake_pred = disc(fake,condition)
---> 27 adv_loss = adv_criterion(fake_pred, torch.ones(fake_pred.shape))
     28 recon_loss = recon_criterion(fake, real)
     29 gen_loss = adv_loss + lambda_recon*recon_loss

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
   1190 # If we don't have any hooks, we want to skip the rest of the logic in
   1191 # this function, and just call forward.
   1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1193         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194     return forward_call(*input, **kwargs)
   1195 # Do not call functions when jit is used
   1196 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.8/dist-packages/torch/nn/modules/loss.py:720, in BCEWithLogitsLoss.forward(self, input, target)
    719 def forward(self, input: Tensor, target: Tensor) -> Tensor:
--> 720     return F.binary_cross_entropy_with_logits(input, target,
    721                                               self.weight,
    722                                               pos_weight=self.pos_weight,
    723                                               reduction=self.reduction)

File /usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:3162, in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
   3159 if not (target.size() == input.size()):
   3160     raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
-> 3162 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

When I check the values of the variables device, recon_criterion and adv_criterion I get coda, L1Loss() and BCEWithLogitsLoss() respectively. So this looks fine to me.
I think my code is fine since the grader gives it full mark, so maybe it’s a bug in the notebook?

paulinpaloalto · March 17, 2023, 7:29pm

But the tests in the notebook may not catch everything, in particular “hard-coding” type errors that happen to match the notebook environment. You used torch.ones without specifying a device, so you get a cpu tensor. That works fine if you run it with a cpu tensor as input, but not when you run it with an input tensor on the GPU and then try to do computations involving both tensors. The better way to handle this is to use:

torch.ones_like(fake_pred)

and then it will just automatically put it on the same device as the input tensor in addition to taking care of duplicating the shape.

The reason that the grader doesn’t fail is that it runs only your graded functions and only runs them on the cpu to save money.

So the bottom line is that this is a bug in your code, but it turns out not to get caught by either the test case in the notebook or by the grader. So maybe we should consider this a bug in the test cases. It’s unlikely they will want to make the change to run the grader on the GPU for cost reasons.

maevil · March 17, 2023, 8:52pm

Thanks for the quick reply!
Using ones_like solved the issue as you said. I didn’t think about specifying the device for torch.ones.
Thanks for helping and explaining it to me

Topic		Replies	Views
Week 1 Assignment: RuntimeError Build Basic Generative Adversarial Networks week-1	7	852	February 15, 2022
C1W4A_Build_a_Conditional_GAN Runtime Error Generative Adversarial Networks Resources	7	191	July 13, 2024
C3W2A Assignment - Fail even though passed all unit tests but training stops after epoch 197 Apply Generative Adversarial Networks week-2	3	632	July 18, 2023
C3_W2 Submission error Apply Generative Adversarial Networks week-2	1	556	August 23, 2023
CUDA error: out of memory Build Basic Generative Adversarial Networks week-3	1	313	August 13, 2022

RuntimeError in C3W2B during training in last cell

Related topics