So, I’ve been plying with this encoder/decoder for quite a while now. Now, I think I’m getting good results regarding the encoding and decoding of images, but I haven’t been able to pass the assignment.
So, to pass it, I tried to follow the architecture given in the lab of week one, plus the recommendations given in the assignment itself. That is, I’m using a architecture with three convolutional layers, and before the latent space, I use a dense layer with 1024 neurons. The problem with this is that I end up with a machine having around 100 million parameters. So, even afer reducing the batch size to 100, and using a L4 GPU (with Colab Pro), the training collapses, due to surpassing the RAM capacity.
I’ve tried different architectures that use max pool and some extra convolution layers for the decoder, reducing the number of parameters to less than 20 million for the whole model. And of course reducing significantly the size of the data being process on the inner layers. And actually one of this machines works quite well after something like 20 hours of training. But after all this time, it does not pass the test on coursera, and now is progressing quite slow.
So, I need help. Below you can see the results on the encoding decoding process.
When you work on this assignment in the Coursera environment, are you able to pass the assignment? I mean if you follow the instructions there are you able to pass it…
Hi, as you can see in the image below, I don’t even get the option to use the Coursera environment. I didn’t even try, since the instructions tell me to use Colab.
I will use the TPU of Colab Pro, since I don’t have access to the Coursera Enviroment. I wasn’t using it before, since it runs quite slow. But knowing the amount of time does help. Thanks.
just for more info, I was only able to train my model for course 3 and course 4 assignment twice for 24hour-cycle, after that training became very slow.
Yeah, I know what you mean. I’ve been playing with several architectures, and it does help you to develop some intuition about it. So I’ve found this project quite fun, even though having to wait by being next to the computer (so Colab doesn’t disconect) is not the best though.
in class sampling codes. for the epsilon step, use tf.keras.backend.random_normal but you have used tf.random_normal.
in the def encoder layer. You aren’t suppose to use any activation for the mu and Sigma step. Also this is not an incorrect coding but I notice you have not named each of the layer, which could be an issue with autograder. So in case correct those steps by referring to the upgraded labs on how to name each layer for this code cell.
in the encoder model code cell, for model step mention output as mu, sigma, z(the sequence)
in the def vae model, for inputs you are suppose to use tf.keras.layers.Input and not tf.keras.Input
in the def vae model, your step mu, Sigma, z = encoder (inputs) is incorrect.
you need to define mu, Sigma and z separately in each steps based on the encoder(inputs) position for mu inputs is [0], for Sigma it’s [1] and for z it’s [2]
You haven’t mentioned how many epoch you trained your model as that need to be mentioned by you. So as per my knowledge it should not require you 100 epochs. You can use lesser number epochs, say start from 50, if that doesn’t give you the desired result try with 60 or 70 epochs.
After doing these corrections if you still notice submission failure or too much training time, then use the fallback session step link mentioned in the previous comment in your post.
I just changed the order of the encoder output to match what you told me. But I was careful before to keep the same order when I used the output on the vae model. So, out of curiosity, is this order important for the grader, or does it impact the inner workings of the machine?
I restarted the training, but now I don’t have any available compute units, so I had to use CPU. I guess tomorrow I will be able to use GPU again. For now I did apply fallback to see if this prevents the processor to colapse. It is quite slow, so in 30 min it has advance only 1.5 epochs (with sample size of 500), but it seems to be stable. I’ll late know later how it continues.
It finally worked, and I was able to finish. By using the fallback, the RAM usage stayed stable, and the loss function went down rather fast. I trained for about an hour and a half, with a 500 sample and with the 15 GB GPU.