Course 4 week 2: Residual Networks - kernel size=1 stride = 2

From the original paper

we see that the authors use a stride of 2 at the beginning of each block to downsample the information (you see the /2 in the image). They also use bottleneck components for Resnet 50 and above:

Next we describe our deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design.

I recommend you read the paper and pay attention to small details.

The title of their paper is Deep Residual Learning for Image Recognition. Deep networks are computationally expensive to train and they found that downsampling makes it computationally possible to have very deep networks, which was the aim for their paper (with residual connections).

Prof Andrew Ng has a video about the tradeoff dilemma, where we can change the network architecture in different ways, given a computational budget:

Now, should you use a stride of 2 on the 1x1 convolutions or the 3x3 convolutions as they show in the picture?

Different libs have tried both ways:

I hope my comments help clear things up a little :smiley:

A bonus answer: Why don’t we change the 1x1 2 stride convolutions in the shortcut to something like max pooling to preserve information? Because we want to have linear functions, and max pooling is non linear. Linear because we want our gradients to flow nicely without interruptions of non linearity. That is the whole purpose of residual connections.

2nd bonus comment: We could use average pooling instead of max pooling, because average pooling is a linear operation.

3 Likes