Course 4 week 2: Residual Networks - kernel size=1 stride = 2

In the residual network, for the first filter of the convolutional block we use:
kernel size = (1,1) and stride = (s,s), where s is usually 2.
Doesn’t that mean that we just throw away 75% of the input information? That doesn’t seem to make sense to me.
What I would expect: the first and third filter to have kernel size(1,1) and stride (1,1),
and the second filter to have kernel size (f,f) and stride(s,s). So that we don’t throw away information.
Could someone help me to understand?

We’re not really throwing away information. We’re extracting higher-order relationships from the raw data. That’s what a CNN does.

Yes, but please look at this specific example: The stride is larger than the filter size. The filter size is 1 width and 1 height. But the stride is 2. So, we look first at all channels at position (0,0) and compute the convolution. Then all channels at position (0,2), that’s the second convolution. And so on. Then at (2,0), (2,2), (2,4) etc.
We never look at the position (0,1) or (1,0) or (1,1), because of the stride. These values are not used at all in our computations, no filter includes them.
In the examples in the lectures, we had for example filter size 3 and stride 2 – in which case it makes sense, the stride is smaller than the filter. But when the stride is bigger than the filter, we are skipping positions, and not using them at all.

1 Like

From the original paper

we see that the authors use a stride of 2 at the beginning of each block to downsample the information (you see the /2 in the image). They also use bottleneck components for Resnet 50 and above:

Next we describe our deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design.

I recommend you read the paper and pay attention to small details.

The title of their paper is Deep Residual Learning for Image Recognition. Deep networks are computationally expensive to train and they found that downsampling makes it computationally possible to have very deep networks, which was the aim for their paper (with residual connections).

Prof Andrew Ng has a video about the tradeoff dilemma, where we can change the network architecture in different ways, given a computational budget:

Now, should you use a stride of 2 on the 1x1 convolutions or the 3x3 convolutions as they show in the picture?

Different libs have tried both ways:

I hope my comments help clear things up a little :smiley:

A bonus answer: Why don’t we change the 1x1 2 stride convolutions in the shortcut to something like max pooling to preserve information? Because we want to have linear functions, and max pooling is non linear. Linear because we want our gradients to flow nicely without interruptions of non linearity. That is the whole purpose of residual connections.

2nd bonus comment: We could use average pooling instead of max pooling, because average pooling is a linear operation.

3 Likes

Thank you for the detailed answer, yes it did help :smiley: I just write my conclusions, in case someone is as confused as I was :slight_smile:
My understanding now is that in the layers where we perform the downsampling, the (1x1 filter, stride=2) layer in fact uses only 25% of its input. But, we don’t throw information away, as I previously thought (only some computations of only the previous layer are wasted) because:

  1. a similar result could be achieved by having the previous 3x3 layer have a stride of 2 instead of 1 – which obviously would not be throwing away information;
  2. when an input is not used, the corresponding neuron is not being trained by the backpropagation algorithm (because that input doesn’t affect the loss function), so it doesn’t carry any useful information anyway.

I discovered that someone on reddit had the exact same question and emailed the authors, and actually got a response:

Kaiming He was as nice to share after I emailed him: "In all experiments in the paper, the stride=2 operation is in the first 1x1 conv layer when downsampling.
This might not be the best choice, as it wastes some computations of the preceding block. For example, using stride=2 in the first 1x1 conv in the first block of conv3 is equivalent to using stride=2 in the 3x3 conv in the last block of conv2. So I feel applying stride=2 to either the first 1x1 or the 3x3 conv should work. I just kept it “as is”, because we do not have enough resources to investigate every choice. "

3 Likes

And now the guys at Nvidia have found out that it actually is better to do the stride=2 at the 3x3 conv instead of the 1x1 conv:

Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
while original implementation places the stride at the first 1x1 convolution(self.conv1)
according to "Deep residual learning for image recognition"[1512.03385] Deep Residual Learning for Image Recognition.
This variant is also known as ResNet V1.5 and improves accuracy according to
ResNet v1.5 for PyTorch | NVIDIA NGC.

1 Like