Requesting clarification on Image Generation


About week2’s programming assignment. How does the final layer ,using deconvolution function, where output channel = 1, input channel =64, with a kernel size= 4, stride = 2, convert the input to a 28*28 image [128,1,28,28]?

Hi, @Joshua_Siy!
I assume that the essence of your question is “how did a latent noise tensor of size [w,h] result in [28,28] output?” (please correct me if this isn’t what you’re asking).
Firstly, in the screenshot above I’ve printed Generator’s layer summary. The output size of each of the layers is determined by the formula [(W−K+2P)/S]+1. Please use this link to experiment with different settings of input tensor size and conv layer parameters.
Secondly, I’ve printed tensor sizes after every conv layer in the generator. If you try to compute tensor sizes using the formula above you should arrive at the final tensor size - [n_samples, 1, 28, 28].
Please let me know if my reply wasn’t helpful or you have any follow up questions.

Kind regards,

Thank you for your response. Maybe I do have another question about the first layer. How did input noise vector [n_samples,z_dim] become → output [100,256,3,3]. I mean using the simulator provided, I’m not sure what’s the input so I guess I’ll just paste my result for the first layer using the simulator. Please advise, as I am unsure of what I’m doing.

I think I’m getting a hint that. the noise vector in completion is [100,64,1,1] . because if I upsample a 1x1 image using a kernel of 3 and a stride of 2, it will result to 3x3 correct?

I just found the answer after looking at the documentation for conv2dtranspose.

Just writing what I understood.
The formula for the output height and width are , inshort excluding the sub equations that lead to zero will be:
Output size = (Input size - 1) * stride - 2 * padding + kernel size

given the input of batch size and channel, (100,64)

We deconvolve this with a layer with kernel of 3 and stride of 2. However I’m not sure what’s the image size of this so I just assumed it is 1x1 .

1x1 with a kernel of 3 and stride of 2 .results to 3x3.

3x3 with a kernel of 4 and stride of 1. results to 6x6,

6x6 with a kernel of 3 and stride of 2 results to a 13x13

13x13 with a kernel of 4 and a stride of 2 results to 28x28.

Please correct me if I’m wrong. Thanks.

the initial dimension of the noise vector is [n_samples,z_dim] → [100, 10]. In forward() method of Generator class this vector gets unsqueezed to [100, 10, 1, 1] this 4 dimensional array can be treated as [batch_size, n_channels, w, h]. So you’re correct in your assumption that the “image” dimension of the noise is 1x1. I only struggle to understand where is 64 coming from.
(64 is the output channel dimension of the second generator block - it takes in 10 channels produces 64).

The rest is correct :+1: