in the programming exercise of week 2 - ResNet, part 3.2 - The Convolutional Block, it states, quote
“For example, to reduce the activation dimensions’s height and width by a factor of 2, you can use a 1x1 convolution with a stride of 2.”
As far as I know, 1x1 convolutions only reduce number of channels in the activation layers, right? It is the pooling layers that reduce the height and width of the input. Is it an error in the programming exercise?
Hi, @realnoob !
Yes, 1x1 convolutions reduce the number of channels, but any convolution that has a stride different than 1 also reduces the output size. In this case, a stride 2 will output a feature map that is half the height and width (stride 3, a third)
Ahh now I get it. Thank you so much, I forgot about that
This is an interesting point. Notice that a 1 x 1 convolution with stride = 2
means that you are literally discarding (completely ignoring) half of the inputs. They could also have done a pooling layer with stride of 2 and gotten the same dimensionality reduction without actually literally ignoring any of the inputs. This question has been asked multiple times before, but I don’t know why they made that design choice. The only argument I can think of is that the 1 x 1 convolution would be a bit less compute cost to achieve that size reduction. But the idea of simply discarding inputs seems a bit counterintuitive. Well, maybe you could consider a Max Pooling layer as ignoring half the inputs: it does drop half of them, but it looks at the values to decide which ones to keep. That’s not really equivalent to ignoring them completely. Well, there is one other difference when using pooling layers: they operate “channelwise”, so the number of channels is preserved. With 1 x 1 convolutions, you can also reduce the number of channels at the same time.
If anyone has the energy, it would be worth reading the original Residual Net papers to see if they comment on this design choice.
3 Likes