Run time of C3W2: SRGAN (Optional) lab

Hello!

I am running the optional lab, C3W2: SRGAN, in Colab. The cell (given below) took some four hours, yet it hasn’t completed its execution. Any thoughts?

Best,
Saif.

device = 'cuda' if torch.cuda.is_available() else 'cpu'
generator = Generator(n_res_blocks=16, n_ps_blocks=2)

# Uncomment the following lines if you're using ImageNet
# dataloader = torch.utils.data.DataLoader(
#     Dataset('data', 'train', download=True, hr_size=[384, 384], lr_size=[96, 96]),
#     batch_size=16, pin_memory=True, shuffle=True,
# )
# train_srresnet(generator, dataloader, device, lr=1e-4, total_steps=1e6, display_step=500)
# torch.save(generator, 'srresnet.pt')

# Uncomment the following lines if you're using STL
dataloader = torch.utils.data.DataLoader(
    Dataset('data', 'train', download=True, hr_size=[96, 96], lr_size=[24, 24]),
    batch_size=16, pin_memory=True, shuffle=True,
)
train_srresnet(generator, dataloader, device, lr=1e-4, total_steps=1e5, display_step=1000)
torch.save(generator, 'srresnet.pt')

7 hours of running but still not completed. I am shutting down my computer. Good night :sleeping: :sleeping:

Hi @saifkhanengr, that training IS slow, but I just gave it a try, and it was faster for me than what you’re seeing. I ran that cell for about 1 hour and it had gotten through 50000 steps - so halfway through the total, which would mean about 2 hours for the full 100000 steps, unless for some reason it started slowing way down towards the end. I was using colab.

If you were also using colab, the only thing I can think is that you somehow were using device = cpu. Maybe try changing the first line of that cell so that you break if torch.cuda.is_available() is false, just as a way to make doubly sure you’re running with gpu.

One other suggestion would be to reduce the total_steps in the call to train_srresnet to something smaller. It should be good enough for experimentation after 20K steps or so. Then you’ll at least have something to use to try out the next part - the SRGAN itself.

FYI, I noticed this old post about this optional lab. As far as I can tell, the calculations for adversarial loss still look suspicious, since both fake and real loss are checking for fake predictions to be false:

g_loss calls: self.adv_loss(fake_preds_for_g, False)
...
d_loss calls: self.adv_loss(fake_preds_for_d, False)

Just something to be aware of when you get to the SRGAN part of the lab. If your results aren’t looking great, you may want to experiment adjusting to make sure the discriminator is trying to predict false for fakes and the generator is trying to get the discriminator to predict true for fakes. (And if you do find you need to change something for this, please post back here so I can ping the developers to remind them that they still need to take a look at this.)

Hello Wendy! Thank you for the reply.

I did this. Commenting the first line of code and adding some my code (below).

#device = 'cuda' if torch.cuda.is_available() else 'cpu'
if torch.cuda.is_available():
    device = 'cuda'
else:
    print("CUDA is not available. Exiting.")
    exit()

It executes and took some 3 hours to complete 100000 steps.

Good suggestion.


Personally, this long-time training demotivates me to further explore or play with this notebook. Bye to this. But I highly appreciate your suggestions…

I agree about the long time being demotivating!

I’ll submit a suggestion to the developers to load a pre-trained model for all or most of this first part (the srresnet training). It’s really the srgan training after this that is the focus of this lab anyway.