C3_M1_Lab_1_siamese_network - Suspected bug/typo in the code

DAResaid · January 20, 2026, 8:43pm

Hello,

Could you, please, confirm that the calculation of the input size of the fully connected linear layer in the SimpleEmbeddingNetwork class of the first lab of the module is correct (please, see the attached screenshot):

To me having 224x224 input image size and three nn.MaxPool2d(2, stride=2) modules, the size should be calculated by: 128 * 28 * 28 (since 224/2/2/2 = 28, but not 25).

Thank you!

paulinpaloalto · January 21, 2026, 4:05am

Notice that the Conv2d layers there do not specify “same” padding. So you can’t assume that the output is the same size, right? The formula is:

n_{out} = \displaystyle \lfloor \frac {n_{in} + 2p - f}{s} \rfloor + 1

Where p is the padding (0 by default), s is the stride (defaults to 1) and f is the filter size (kernel size).

You have two layers each with f = 5, followed by one with f = 3 with MaxPool layers after each. What does that give you as the output shape?

paulinpaloalto · January 21, 2026, 4:37am

The higher level point is that the code was just given to you. Did you try running it? If your theory is correct, then it should explode and catch fire, but I’ll bet that it doesn’t. It’s worth working through the implications of what I said in the previous reply.

Just to work through the first conv layer:

n_{out} = 224 + 2 * 0 - 5 + 1 = 220

Then the MaxPool layer will give you 110 as the output dimension of the layer. Not the same as \frac {224}{2}, right?

The rest is left as an exercise to the reader.

Deepti_Prasad · January 21, 2026, 8:33am

IF I CALCULATE LAYER WISE

First Convolution and Maxpooling

The input is an RGB image of 224 x 224 x 3

Conv2d layer 1 - Using 32 filters with a 5×5 kernel (stride 1, no padding), the output spatial dimension is ( 224-5) + 1 = 220.

Output = 220 x 220 x 32 feature map

Maxpool 2d layer 1 - with a 2 x 2 kernel and stride 2, the dimension are halved

Output 220/2= 110

Output shape is 110 x 110 x 32 feature map.

Second Convolution and MaxPooling

Conv2d layer 1 - Using 64 filters with a 5×5 kernel (stride 1, no padding), the output spatial dimension is (110-5) + 1 = 106

Output = 106 x 106 x 64 feature map

Maxpool 2d layer 1 - with a 2 x 2 kernel and stride 2, the dimension are halved

Output 106/2=53

Output shape is 53 x 53 x 64 feature map.

Third Convolution and MaxPooling

Conv2d layer 1 - Using 32 filters with a 3x3 kernel (stride 1, no padding), the output spatial dimension is (53- 3)+ 1 = 51

Output = 51 x 51 x 128 feature map

Maxpool 2d layer 1 - with a 2 x 2 kernel and stride 2, the dimension are halved

Output 51/2 = 25

Output shape is 25 x 25 x 128 feature map.

So input dimension for the first fully connected layer is the final feature map’s height, width, and depth

128 x 25 x 25

paulinpaloalto · January 21, 2026, 2:29pm

Another high level point here is that you really can’t start your ML/DL learning by taking this PyTorch specialization. You need to take DLS first, so that you understand the different types of networks (Fully Connected, ConvNets, RNNs and so forth), the fundamentals of how they work and what they are used for. The PyTorch series assumes you already know the fundamentals and then proceeds to show you how to build those things in torch. In DLS C4 (ConvNets), it shows many examples of conv layers which reduce the dimensions even before the pooling layers.

DAResaid · January 21, 2026, 2:51pm

Thank you! Yes, I missed the different kernel sizes in layers, sorry (inertia of thinking from the previous courses/modules where it was always equal to 3).

The second false alarm in a row on my part (after a few valid ones). It seems I need to start being more careful and thoroughly research the topics I ask about. And I will do so.

Thank you again.

paulinpaloalto · January 21, 2026, 2:55pm

Yes, we have seen lots of instances of conv layers that preserved the dimensions up to this point, but if you go back and check they all include non-zero padding. Without padding, conv layers reduce the output size. Even with padding they can also reduce the size especially if the stride is > 1. Here’s a thread about the surprising behavior of padding=same in TF anyway.

paulinpaloalto · January 21, 2026, 3:13pm

Just to be sure we are clear here, the kernel_size = 5 is not what triggers the reduction in size. It is the fact that p = 0 for all the layers in this case. The amount of padding you need to get the same output size varies depending on the kernel size. If we make the simplifying assumption that s = 1, then it’s easy to solve for p:

n_{in} = n_{in} + 2p -f + 1
2p = f - 1
p = \displaystyle \frac {f - 1}{2}

So if f = 3, then we need p = 1.

If f = 5, then p = 2.

And so forth …

Note that if f = 3 and p = 0 you also get a reduction in size.

DAResaid · January 21, 2026, 3:21pm

Yes, thank you. I understand how it works and what these parameters mean and how they relate to each other. I mentioned kernel sizes because their difference appeared explicitly in the code, but I still missed it, sorry.

Topic		Replies	Views
MaxPooling2D - layer size and related formulae PyTorch: Techniques and Ecosystem Tools week-module-1 , dl-ai-learning-platform	5	71	January 18, 2026
C1M4 - Determining the size of a flattened layer following conv_blocks PyTorch: Fundamentals week-module-4 , dl-ai-learning-platform	2	25	December 1, 2025
The output size of MaxPooling2D layer Introduction to TF for Artificial Intelligence ... week-module-4	2	701	September 24, 2021
C4 W1 A2 MaxPooling2D Convolutional Neural Networks coursera-platform	6	480	September 29, 2023
About Kernel Size Pool Size and Strides in Advace Computer vision week 3 Advanced Computer Vision with TensorFlow week-module-3	1	747	August 8, 2023

C3_M1_Lab_1_siamese_network - Suspected bug/typo in the code

Related topics