Question about W2 residual network programming assignment

Hi, I working on W2’s programming assignment right now. My question is why are we using the valid padding on the first and third components while using same padding on the second components? Thanks!

1 Like

This is a good question to understand a couple of important things, i.e,

  1. How Resnet could reduce the computational power requirements to create a deep network ?
  2. What is the additional considerations for the short-cut design ?

Let’s start with an identity block overview. Since we focus on the shape of features, I removed BatchNormalization and Activation from the figure.

As you see, there are 3 convolutional layers. But, role of each layer is different.
There is one main convolutional layer which has (3x3) kernel (filter) at the middle. This is to extract features, and is the main task.
The challenge is, multiple convolutions in deeper network require huge computational cost (and time). So, authors add additional “light” convolutional layers before the main convolution and after the main convolution. That’s 1x1 convolutional layer.

  • The role of the 1st 1x1 convolutional layer is to reduce the depth with small number of filters, 64. As the result, the channel size was reduced from 256 to 64, which has a significant impact on the computational cost in the next layer.
  • The role of the last 1x1 convolutional layer is to restore the depth to be merged with a short cut, which has 256 channels. So, by using 1x1 convolution with 256 filters, authors successfully restored the depth.

Then, let’s think about “padding” which is your question.

  • For the 1st and 3rd convolutional layer, which are 1x1 convolution, no padding is required. It’s 1x1 convolution with a stride=1. All required elements can be captured by this filter. So, “valid” is the right choice.
  • For the 2nd convolutional layer, the important thing is to consider the merger with a shortcut. So, we want to keep the dimensions as much as possible. If there is no padding, then, the width and depth will be reduced. So, using “padding=same” is quite reasonable.

The above concept is also applied to the convolutional_block as well. One difference is that there is a convolutional step in a short-cut. So, we need to be more careful about the dimensions/sizes of both paths, which are merged at the last. And, the convolutional block accepts to change the stride size for the first 1x1 convolution, and shortcut. But, as you see, the same number is used for both, which helps us to keep the same output size. And, setting “padding=valid” is true for the 1st convolution even if we change a stride size. Again, it is 1x1 convolution, i.e, pick a single element. If a position is outside of a target image by a larger stride, we do not need to add any padding, since there is no data. (in the case of “n"x"n” convolution, there may be data at somewhere, though…)

Hope this helps.

1 Like

That’s well explained! Thanks so much for your help!