Week 2 ResNet programming exercise: the use of one-by-one convolution

in the programming exercise of week 2 - ResNet, part 3.2 - The Convolutional Block, it states, quote

“For example, to reduce the activation dimensions’s height and width by a factor of 2, you can use a 1x1 convolution with a stride of 2.”

As far as I know, 1x1 convolutions only reduce number of channels in the activation layers, right? It is the pooling layers that reduce the height and width of the input. Is it an error in the programming exercise?

Hi, @realnoob !

Yes, 1x1 convolutions reduce the number of channels, but any convolution that has a stride different than 1 also reduces the output size. In this case, a stride 2 will output a feature map that is half the height and width (stride 3, a third)

Ahh now I get it. Thank you so much, I forgot about that :sweat_smile:

This is an interesting point. Notice that a 1 x 1 convolution with stride = 2 means that you are literally discarding (completely ignoring) half of the inputs. They could also have done a pooling layer with stride of 2 and gotten the same dimensionality reduction without actually literally ignoring any of the inputs. This question has been asked multiple times before, but I don’t know why they made that design choice. The only argument I can think of is that the 1 x 1 convolution would be a bit less compute cost to achieve that size reduction. But the idea of simply discarding inputs seems a bit counterintuitive. Well, maybe you could consider a Max Pooling layer as ignoring half the inputs: it does drop half of them, but it looks at the values to decide which ones to keep. That’s not really equivalent to ignoring them completely. Well, there is one other difference when using pooling layers: they operate “channelwise”, so the number of channels is preserved. With 1 x 1 convolutions, you can also reduce the number of channels at the same time.

If anyone has the energy, it would be worth reading the original Residual Net papers to see if they comment on this design choice.

Hi @paulinpaloalto I also feel ignoring inputs in the cause of implement a stride of 2 may not be the best. However, concentrating on the reason behind the 1x1 convolution with stride 2 layer which is to reduce the input by half, I observed that the intuition works but when I try to describe this using the formula; ((n + 2p - f)/ 2) + 1 always returns 0.5 more no matter the input shape. For instance a 28x28 input image, using this layer specification should return a 14x14 output labels. However, when described with the formula above; I get 14.5.

That is not the complete formula. Here it is in full mathematical notation:

n_{out} = \displaystyle \lfloor \frac {n_{in} + 2p - f}{s} \rfloor + 1

The key point that is missing from your version is the “floor” function, which is expressed with the \lfloor and \rfloor symbols. The floor function takes an input and returns the largest integer which is less than the input value. That is necessary because (as you experienced) as soon as the stride is > 1, then you face the issue that the numerator may not be evenly divisible by the stride.

Professor Ng covered this point in DLS C4 W1 in the lecture titled Strided Convolutions. Watch starting at time offset 3:00 or a little before that to get the context.

Perfect. Thank you again