Understanding sizes in GANs

Hi everyone,

I am currently toying around with the WGAN-GP trying to build a simple GAN that could fit my needs.

Obviously, my inputs do not fit directly into the GAN and I am looking at the building blocks trying to understand the dimension changes in the Convolution layers.

My real data is a tensor of shape (50,128). I can make it (1,50,128) so we know it has only one channel.

When you look at the code to create a generator (or discriminator block) we have this:

def make_gen_block(self, input_channels, output_channels, kernel_size=3, stride=2, final_layer=False)

And maybe I’m confused at the meaning of channels here, because I thought it was like RGB or whatever 3rd dimension the data has.

Could you explain me how the height and width of the image come into play ? I understand the convolution operation in itself, with kernels etc. but it’s the parameter and how the data is going to be “processed” that I struggle with.

Thank you very much

1 Like

Hi @Barb ,

The parameters input_channels and output_channels in make_gen_block are used by nn.ConvTranspose2d.The documentation of nn.ConvTranspose2d will have a detailed explain for these parameters.

Moreover, channels are not always be the 3rd dimension. For example, PyTorch function always assume it is the dimension before dimensions of height or wigth.

Output height and width unequivocally computed by kernel size and stride, but number of channels (“copies”/“layers”/“slices” of HxW) can be arbitrary so we need it in arguments of function. Number of output channels has a direct impact on the number of kernels as we need separate kernel by every output channel to learn certain “feature” from the one previous layer “picture”.

Hi @Barb

As you have seen in the convTranspose2d operation, the height and width of the tensor are not passed directly. The reason is that the height and width of the output tensor are determined using other parameters such as padding, stride, and kernel shape.

The height and width of a tensor after a normal convolution operation are calculated using the formula (there is a different formula for conv2dtranspose) given below:

New_Tensor_Height = (Current_Tensor_Height + 2 * Padding - Kernel_Height) / Stride + 1
New_Tensor_Width = (Current_Tensor_Width + 2 * Padding - Kernel_Width) / Stride + 1

For example, you have an input tensor of shape (1, 1, 50, 128) and now let us say that you applied nn.Conv2d on top of it with the following parameters:

input_channels = 1
output_channels = 10
kernel_size = 3
padding = 2
stride = 2

Then the new height and width of the tensor will be calculated as follows:

New_Tensor_height = (50 + 2 * 2 - 3) / 2 + 1 = 26.5 => 26 (dims can’t be float)
New_Tensor_width = (128 + 2 * 2 - 3) / 2 + 1 = 65.5 => 65 (dims can’t be float again)

So, the shape of the output tensor will be (1, 10, 26, 65). Run the following code to see our experiment in action.

import torch
input_tensor = torch.ones([1, 1, 50, 128])
conv_layer = torch.nn.Conv2d(in_channels = 1, out_channels = 10, kernel_size = 3, stride = 2, padding = 2)
output_tensor = conv_layer(input_tensor)
# The output will be:  torch.Size([1, 10, 26, 65])

Now let us understand the channels. Pytorch general format of tensor is [N, C, H, W] i.e. [batch_size, Number of Channels, Height, Width]. There are no restrictions on the number of channels. You can create a tensor of as many channels as you want. For instance, a color image has 3 channels that are R, G, and B. A grayscale image has only a single channel. You can create your own tensor with shape [1, 16, 50, 128] that has got 16 channels.

So you only need to tell the conv_layer about the number of channels before and after the operation. In the previous example, you have created conv2d_layer which will take an input tensor that has only 1 channel and the convolution operation will define and use 10 kernels (the out_channels determines the number of kernels to be used for convolution) for convolving with the input to generate the output tensor of 10 channels.

Dear all,

Thank you so much for your answers. I didn’t actually realize why width and height were not directly passed as parameters.

One thing I’m not sure of about the stride and kernel size for instance, is why are there referenced as kernel_size[0] or kernel_size[1]. I guess it is because you can pass a tuple as a parameter but how does it work to have two different kernel sizes for one transposed convolution ?

Hi @Barb

While passing the tuple for kernel_size and stride as (3, 4) and (1, 2) respectively does not mean that you are passing two different kernel sizes.

If you pass kernel size as an integer, for instance, 3, then a square kernel of dimensions (3, 3) is assumed by PyTorch. And, if you pass kernel_size as (3, 4) then it means that all your kernels will have a height of 3 and a width of 4.

The same is true for stride. When you pass an integer, for instance, 2, then all kernels will perform convolution with a stride of 2 in both the height and the width direction of the feature map. But if you are passing stride = (2, 1), then it means that kernels will use a stride 2 in the height dimension and 1 in the width dimension.

Note: Passing a tuple (3, 3) is SAME to that of passing just an integer 3. This is true for kernel_size and stride.

Hi @28utkarsh,

Oh I see, I wasn’t sure if the convolution maths allowed for non square kernels.

There is one more thing I don"t understand about the shape of the noise vector.

We generate a batch of noises of shape (n_batch, z_dim). However, how is the noise vector of size 64 (in the current WGAN) interpreted in terms of height and width ?

Edit: By experimenting, it looks like the noise is actually a 1x1 over z_dim channels. Is that correct ?

Hi @Barb
Actually, (n_batch, z_dim) specifies the batch size and length (64) of the noise vector. In the forward pass, you will add the height and width dimensions to the noise tensor by unsqueezing the tensor. The forward pass of the Generator will call the forward method of the Generator class. Check the following screenshot of the forward method presented in the DCGAN Notebook.

Notice that you are calling the self.unsqueeze_noise method to add the height and width dimensions to the noise tensor. In the unsqueeze method, another view of the tensor is returned.

Note that the shape of the noise tensor does not change as soon as in the memory when you call the view method. It will still be (n_samples, z_dim). The view will act as a pointer or as a facade on top of the underlying tensor, and every time the underlying noise tensor will be resized on demand. Also, changing any value in the view of the tensor will also change the value in the main tensor.

@28utkarsh Oh I see! I skipped the view part, because in WGAN it’s not a separate function.

Sorry @Barb , I didn’t see your edit as I was writing the answer.

If I have understood your question correctly then inherently the shape over z_dim is 1x1 but that’s not the correct way of interpretation. Just imagine that you could have said that the shape over z_dim is 1x1x1x1…1. You could have added as many 1s as you want. Our program doesn’t understand the difference between z_dim, height, and width. All of them are merely a dimension for our computer.

So, you should add exact height and width dimensions to your tensor and call it is as [n_samples, z_dim, 1, 1]. The z_dim is z_dim ONLY.

I hope that my response answers your query.

@Barb , I guess that the notebook creators always present a new concept in a separate function for the first time to highlight it clearly. Later in the following notebooks, they make the code of the separate function to the part of general code.

Hi @28utkarsh,

Thanks for these explanations. Considering the different videos etc, I think I’m not sure about what does z_dim actually represent. I understand the concept of noise, being an input to be fed to the generator. But then, what should be the shape of such input ? Why do we unsqueeze the noise into shape (z_dim,1,1) and not (z_dim,2,1) for instance ?

Additionally, how do you usually choose kernel size, stride etc. to make your generator output something of a specific size ? Is it trial and error ?

Hi @Barb
Firstly, z_dim is just a vector having 64 values drawn from a normal distribution. But, you cannot perform a PyTorch 2D convolution operation on top of a vector because the vector has only a single dimension. You can perform a 2D convolution on top of a 4-dimensional Tensor. And that’s the reason behind unsqueezing (simply adding new dimensions) your noise tensor.

Now, why into shape (n_samples, z_dim, 1, 1) and not (n_samples, z_dim, 2, 1). I will try to explain this with a different example.

Try to imagine an images tensor that has got 4-dimensions with shape [4, 3, 8, 8].

The first dimension represents the batch size which is just the number of images present in a batch. In my example, there are 4 images present in the batch. The other 3 dimensions are channels, height, and width respectively. The product of all 4 dimensions is (4x3x8x8) = 768. Now, you can reshape (I am not resizing) this tensor to any shape as long as the product of values (total number of entries in the tensor) of new dimensions is 768. For instance, you can reshape the tensor into shapes like [1, 12, 8, 8], [4, 1, 24, 8], [4, 3, 1, 64], [4, 192, 1, 1], and many more. Notice that, in all the new shapes, you will find the product of all the dimensions equal to 768. And therefore, you can’t have a shape that has a product of its dimensions different from 768, for instance, [4, 192, 2, 1] (highlighting this to explaining your query). A different product means that either you are underestimating or overestimating the number of entries present in the tensor. Similarly, you can’t resize (n_samples, z_dim) to (n_samples, z_dim, 2, 1) because you are overestimating the number of entries in the tensor.

Secondly, kernel size and stride are all hyperparameters of your layer, so you can adjust them as per your experiments. But kernel size = 3 and stride = 1 or 2 are kinds of more general. Having a larger kernel size means that you want to capture a larger context of the image tensor, but that increases the computation time to a great extent. Similarly, stride represents the amount of information that you want to skip during convolution. A larger stride will lead to a high loss of information. All these depend on the experiment setup.

@28utkarsh Thank you so much for the detailed explanation.
Do we obtain the same result no matter the squeezing (as long as we respect the number of dimensions ?)

Also, I get your point about the information kernel size and stride.

Again, thanks a lot for all these explanations, they’ve been really helpful.


Yeah. You are good as long as you are not changing the entries present in the tensor explicitly. You are only changing the way it looks i.e. reshaping the tensor because you want to make it compatible with a 2D-convolution operation.