In identity block, during the first conv2D, padding applied is “valid” . But “valid” reduces the height and width of the input image. Why is valid chosen over same padding over here?

```
# First component of main path
X = Conv2D(filters = F1, kernel_size = 1, strides = (1,1), padding = 'valid', kernel_initializer = initializer(seed=0))(X)
X = BatchNormalization(axis = 3)(X, training = training) # Default axis
X = Activation('relu')(X)
```

Thank you in advance.

1 Like

Generally speaking “valid” padding will reduce the height and width of an image, but notice that this is a special case: the filter size is 1 and stride is 1, so this is a so called “1 x 1” convolution. If you apply the normal formula for computing the output size in this special case:

n_{out} = \lfloor \displaystyle \frac {n_{prev} + 2p - f}{s} \rfloor + 1

With p = 0, f = 1 and s = 1, you’ll see that n_{out} = n_{prev}. So in that special case of f = 1 and s = 1, “same” and “valid” padding are the same thing.

But note that they have planned out the transformations here so that it all matches up. The maybe more interesting question is why in the convolutional block, there is one case in which they use a 1 x 1 convolution with a stride of 2. That literally means they are ignoring half of the inputs. Not sure why they do that: wouldn’t an average or max pooling layer achieve the same dimension reduction without losing information? I don’t have an answer, but you can try looking at the residual net paper and see if they say more about this.

Hi,

after just finished the exes, I try to understand the code more, still figuring why there is 1x1 convolution on first and third component, and note that the #filters is different on training [3, 3, 3] vs. on inference[4, 4, 3], on the testing code, still no idea on it as it didn’t mention on lecture note.

Welcome to the community.

why there is 1x1 convolution on first and third component,

This is actually a very important point for this network. As you know, we are creating a very deep neural network. In that case, one of biggest challenging is a computational cost (time).

As convolutions take time, we need some good mechanism.

In this paper, Deep Residual Learning for Image Recognition,

authors proposed “bottleneck” . As you see there are three convolutions.

- The role of the 1st convolution (1x1) is to reduce the depth. For example, assuming that the input is (h,w,c), by applying “n” filters, output will be (h,w,n). So, if we select a small number for “n”, then, the depth (channel) can be reduced.
- The role of the 2nd convolution is, “real” feature extractions. So, it uses 3x3 filters. By the reduction of the depth, the computational cost can be decreased.
- The role of the 3rd convolution is to restore “depth”. For example, if we apply “c” filters, then, the shape of data can be restored as (h,w,c).

So, this is one of excellent techniques that Resnet proposed.

the #filters is different on training [3, 3, 3] vs. on inference[4, 4, 3], on the testing code,

I may not catch your point. Assuming that we are discussing about identify_block for Resnet, the number of filters are hard-coded in ResNet50() by yourself. So, there is no changes in training time and inference time. Could you elaborate this ? I may be missing something.

Hope this helps.

Thank you for your clear explanation !

I thought bottleneck only used in MobileNet…

The filter # question is based on the test case code on the exe. :

*A3 = identity_block(X, f=2, filters=[4, 4, 3],*

*A4 = identity_block(X, f=2, filters=[3, 3, 3],*

I think it is just for test case purpose for different scenarios, it is not related to the *“training mode”*…

I think it is just for test case purpose for different scenarios, it is not related to the *“training mode”* …

Yes, that’s right. Those are just a testing purpose (in “public_test.py”) to ensure that your identify_block can support different types of inputs.

And, this “training” parameter is for BatchNormalization, not for Conv2D. As you know, BatchNormalization changes its behavior by this flag.