Too much time training

Maur_cd · January 17, 2025, 1:18am

Hi,

So, I’ve been plying with this encoder/decoder for quite a while now. Now, I think I’m getting good results regarding the encoding and decoding of images, but I haven’t been able to pass the assignment.

So, to pass it, I tried to follow the architecture given in the lab of week one, plus the recommendations given in the assignment itself. That is, I’m using a architecture with three convolutional layers, and before the latent space, I use a dense layer with 1024 neurons. The problem with this is that I end up with a machine having around 100 million parameters. So, even afer reducing the batch size to 100, and using a L4 GPU (with Colab Pro), the training collapses, due to surpassing the RAM capacity.

I’ve tried different architectures that use max pool and some extra convolution layers for the decoder, reducing the number of parameters to less than 20 million for the whole model. And of course reducing significantly the size of the data being process on the inner layers. And actually one of this machines works quite well after something like 20 hours of training. But after all this time, it does not pass the test on coursera, and now is progressing quite slow.

So, I need help. Below you can see the results on the encoding decoding process.

gent.spah · January 17, 2025, 7:27am

When you work on this assignment in the Coursera environment, are you able to pass the assignment? I mean if you follow the instructions there are you able to pass it…

Deepti_Prasad · January 17, 2025, 8:43am

use colab TPU, if you are working on course provided environment, but training would still take 1-2 hours

Maur_cd · January 17, 2025, 11:39pm

Hi, as you can see in the image below, I don’t even get the option to use the Coursera environment. I didn’t even try, since the instructions tell me to use Colab.

Maur_cd · January 17, 2025, 11:51pm

I will use the TPU of Colab Pro, since I don’t have access to the Coursera Enviroment. I wasn’t using it before, since it runs quite slow. But knowing the amount of time does help. Thanks.

Deepti_Prasad · January 18, 2025, 12:00am

just for more info, I was only able to train my model for course 3 and course 4 assignment twice for 24hour-cycle, after that training became very slow.

Did surely checked my patience but learnt alot

Maur_cd · January 18, 2025, 12:05am

Yeah, I know what you mean. I’ve been playing with several architectures, and it does help you to develop some intuition about it. So I’ve found this project quite fun, even though having to wait by being next to the computer (so Colab doesn’t disconect) is not the best though.

Deepti_Prasad · January 18, 2025, 12:30am

upon that I had a screensaver setting for every 10 minutes

Deepti_Prasad · January 30, 2025, 7:22pm

did you try the fallback session step

I am going through your codes, will response in sometime

Maur_cd · January 30, 2025, 7:34pm

No, I haven’t try this, but this wasn’t really my problem, but it seem related. I’ll try it out.

Deepti_Prasad · January 30, 2025, 7:50pm

issues with your assignment codes.

in class sampling codes. for the epsilon step, use tf.keras.backend.random_normal but you have used tf.random_normal.
in the def encoder layer. You aren’t suppose to use any activation for the mu and Sigma step. Also this is not an incorrect coding but I notice you have not named each of the layer, which could be an issue with autograder. So in case correct those steps by referring to the upgraded labs on how to name each layer for this code cell.
in the encoder model code cell, for model step mention output as mu, sigma, z(the sequence)
in the def vae model, for inputs you are suppose to use tf.keras.layers.Input and not tf.keras.Input
in the def vae model, your step mu, Sigma, z = encoder (inputs) is incorrect.
you need to define mu, Sigma and z separately in each steps based on the encoder(inputs) position for mu inputs is [0], for Sigma it’s [1] and for z it’s [2]
You haven’t mentioned how many epoch you trained your model as that need to be mentioned by you. So as per my knowledge it should not require you 100 epochs. You can use lesser number epochs, say start from 50, if that doesn’t give you the desired result try with 60 or 70 epochs.

After doing these corrections if you still notice submission failure or too much training time, then use the fallback session step link mentioned in the previous comment in your post.

let me know the progress.

Maur_cd · January 30, 2025, 7:56pm

Ok. I will make those corrections and will let you know how everything goes. Thank you.

Maur_cd · January 30, 2025, 9:07pm

I just changed the order of the encoder output to match what you told me. But I was careful before to keep the same order when I used the output on the vae model. So, out of curiosity, is this order important for the grader, or does it impact the inner workings of the machine?

I restarted the training, but now I don’t have any available compute units, so I had to use CPU. I guess tomorrow I will be able to use GPU again. For now I did apply fallback to see if this prevents the processor to colapse. It is quite slow, so in 30 min it has advance only 1.5 epochs (with sample size of 500), but it seems to be stable. I’ll late know later how it continues.

Deepti_Prasad · January 30, 2025, 9:47pm

try using fallback session step in the previous comment I mentioned

Maur_cd · January 31, 2025, 5:41am

I did use it. When I didn’t the ram consumed started growing after the first epoch. I’ll use it again once I get access to GPU.

Maur_cd · February 2, 2025, 6:16pm

Hi Deepti,

It finally worked, and I was able to finish. By using the fallback, the RAM usage stayed stable, and the loss function went down rather fast. I trained for about an hour and a half, with a 500 sample and with the 15 GB GPU.

Thank you for all your help!

Topic		Replies	Views
Issues with training Generative Deep Learning with TensorFlow week-3 , assignment	6	35	January 30, 2025
How long will the assignment training take? Generative Deep Learning with TensorFlow week-4	4	390	December 14, 2023
TF1 c2 w1 Assignment is taking way to long to train Convolutional Neural Networks in TensorFlow week-1	5	553	December 24, 2022
Training process requires quite a long time Convolutional Neural Networks in TensorFlow week-1	3	563	February 17, 2022
Google Colab notebook taking too long to train Convolutional Neural Networks in TensorFlow week-1	5	42	August 11, 2024

Too much time training

Related topics