Logic bug: convolutional_block() ignores a large fraction of its input?

In the assignment “Residual_Networks”:

For both the paths in convolutional_block(), i.e. both the main path and the shortcut path, the first step is to apply a 1x1 convolution with a stride of s (where s often equals 2).

Doesn’t this mean that a large fraction of the input to the block (e.g., the activations from the previous neural-network layer) is ignored (because the 1x1 convolution with stride > 1 is effectively subsampling the input)?

If this is the case, then much computation (and many weights) in the previous layer get wasted?

Has anyone looked at this?

I now realize that weights do not go unused in the previous layer (because due to convolution the weights apply to both used and unused output activations of the previous layer).

However, it still seems inefficient to be computing many output values in the previous layer and then ignoring ~75% of them when these are used as input to convolutional_block(s=2) — if I understand correctly.

I agree with both of your conclusions:

  1. The weights are not “wasted” since that’s how convolution layers work: the filter coefficients are applied everywhere.
  2. It is the case that a 1 x 1 convolution with stride > 1 just discards a large fraction of the outputs of the previous layer.

Normally in the ConvNet architectures we’ve seen up to this point, the downsizing of the height and width dimensions has been done by a combination of “valid” padding and pooling layers. At least a pooling layer does not ignore any of the inputs, although not all of them have any effect in the case of max pooling. So this is effectively a “nil” or blind pooling layer, if you will. I agree it seems counterintuitive that this is an effective strategy, but I have not done any further investigation. Prof Ng does give us references to the original Residual Net papers. It might be worth a look at those to see if they comment on this aspect of the architecture. I guess the other possibility is that this is just a bug in how the course notebooks have specified the network. That might also become apparent if you look at the papers or perhaps the authors have an open source implementation we could look at. If you have time and interest to do any of that investigation, please share anything further that you learn. I’m interested in the question, but can’t promise I will have time to investigate further in the next few days anyway. Thanks for noticing this and starting the discussion!

Thanks for the detailed response.

I went to look at the original paper – the caption of Table 1 mentions downsampling with a stride of 2 but doesn’t provide details as to which layer in the block performs this.

Then I looked at their implementation (prototxt for caffee I presume) and it does show the kernel_size=1 stride=2 as the first step in the downsampling block.

This is also confirmed by the conv_block in the fchollet Keras implementation, which very closely follows the code in this course’s assignment.

So the conclusion is that this is how it’s always been done. There is inefficiency for 3 layers (conv_block with stride >1) but that seems to be OK (only 3 out of a total of 50 layers).

Thank you very much for following up on this and resolving the questions!

As you say, if it’s only 3 out of the 50 layers where the downsampling happens, that loss of information apparently doesn’t spoil the results. If we were in an experimental mood, it might be interesting to try resetting the stride to 1 in those three cases and then following those layers with an average or max pooling layer and see if that makes a perceptible difference in the performance of the resulting models. That approach would increase the computational expense a bit, but lose less information.

Now that you’ve found the source to another implementation of Residual Nets, there was another really interesting technical question that came up in the last couple of weeks about how our implementation here in the notebook works: that concerns how it handles the “training” argument for the BatchNorm layers. Here’s a thread about that issue to see if it catches your interest! :nerd_face:

Thanks again!