Under the function convolutional_block(), in the first component of the main path, we are doing 1x1 convolution with strides=2.
This really confuses (and annoys) me.
Why are we looking at only 1/2 * 1/2 = 1/4 of activations of the input layers?
Why are we discarding 3/4 of the activations (neurons) from the previous layer, without examining them?
If they are really unnecessary, then we can just make the input layer smaller in size.
Otherwise, we can have used a max pooling to reduce the size, instead of just discarding them anyway.
Were you able to find an answer to this question?
Not yet. I am waiting for an answer here.
You can find some answers in this discussion: