C3_M1_Lab_1_siamese_network - Suspected bug/typo in the code

Hello,

Could you, please, confirm that the calculation of the input size of the fully connected linear layer in the SimpleEmbeddingNetwork class of the first lab of the module is correct (please, see the attached screenshot):

To me having 224x224 input image size and three nn.MaxPool2d(2, stride=2) modules, the size should be calculated by: 128 * 28 * 28 (since 224/2/2/2 = 28, but not 25).

Thank you!

Notice that the Conv2d layers there do not specify “same” padding. So you can’t assume that the output is the same size, right? The formula is:

n_{out} = \displaystyle \lfloor \frac {n_{in} + 2p - f}{s} \rfloor + 1

Where p is the padding (0 by default), s is the stride (defaults to 1) and f is the filter size (kernel size).

You have two layers each with f = 5, followed by one with f = 3 with MaxPool layers after each. What does that give you as the output shape?

1 Like

The higher level point is that the code was just given to you. Did you try running it? If your theory is correct, then it should explode and catch fire, but I’ll bet that it doesn’t. It’s worth working through the implications of what I said in the previous reply.

Just to work through the first conv layer:

n_{out} = 224 + 2 * 0 - 5 + 1 = 220

Then the MaxPool layer will give you 110 as the output dimension of the layer. Not the same as \frac {224}{2}, right?

The rest is left as an exercise to the reader. :nerd_face:

1 Like

IF I CALCULATE LAYER WISE

First Convolution and Maxpooling

The input is an RGB image of 224 x 224 x 3

Conv2d layer 1 - Using 32 filters with a 5Ă—5 kernel (stride 1, no padding), the output spatial dimension is ( 224-5) + 1 = 220.

Output = 220 x 220 x 32 feature map

Maxpool 2d layer 1 - with a 2 x 2 kernel and stride 2, the dimension are halved

Output 220/2= 110

Output shape is 110 x 110 x 32 feature map.

Second Convolution and MaxPooling

Conv2d layer 1 - Using 64 filters with a 5Ă—5 kernel (stride 1, no padding), the output spatial dimension is (110-5) + 1 = 106

Output = 106 x 106 x 64 feature map

Maxpool 2d layer 1 - with a 2 x 2 kernel and stride 2, the dimension are halved

Output 106/2=53

Output shape is 53 x 53 x 64 feature map.

Third Convolution and MaxPooling

Conv2d layer 1 - Using 32 filters with a 3x3 kernel (stride 1, no padding), the output spatial dimension is (53- 3)+ 1 = 51

Output = 51 x 51 x 128 feature map

Maxpool 2d layer 1 - with a 2 x 2 kernel and stride 2, the dimension are halved

Output 51/2 = 25

Output shape is 25 x 25 x 128 feature map.

So input dimension for the first fully connected layer is the final feature map’s height, width, and depth

128 x 25 x 25

1 Like

Another high level point here is that you really can’t start your ML/DL learning by taking this PyTorch specialization. You need to take DLS first, so that you understand the different types of networks (Fully Connected, ConvNets, RNNs and so forth), the fundamentals of how they work and what they are used for. The PyTorch series assumes you already know the fundamentals and then proceeds to show you how to build those things in torch. In DLS C4 (ConvNets), it shows many examples of conv layers which reduce the dimensions even before the pooling layers.

2 Likes

Thank you! Yes, I missed the different kernel sizes in layers, sorry (inertia of thinking from the previous courses/modules where it was always equal to 3).

The second false alarm in a row on my part (after a few valid ones). It seems I need to start being more careful and thoroughly research the topics I ask about. And I will do so.

Thank you again.

1 Like

Yes, we have seen lots of instances of conv layers that preserved the dimensions up to this point, but if you go back and check they all include non-zero padding. Without padding, conv layers reduce the output size. Even with padding they can also reduce the size especially if the stride is > 1. Here’s a thread about the surprising behavior of padding=same in TF anyway.

1 Like

Just to be sure we are clear here, the kernel_size = 5 is not what triggers the reduction in size. It is the fact that p = 0 for all the layers in this case. The amount of padding you need to get the same output size varies depending on the kernel size. If we make the simplifying assumption that s = 1, then it’s easy to solve for p:

n_{in} = n_{in} + 2p -f + 1
2p = f - 1
p = \displaystyle \frac {f - 1}{2}

So if f = 3, then we need p = 1.

If f = 5, then p = 2.

And so forth …

Note that if f = 3 and p = 0 you also get a reduction in size.

1 Like

Yes, thank you. I understand how it works and what these parameters mean and how they relate to each other. I mentioned kernel sizes because their difference appeared explicitly in the code, but I still missed it, sorry.