Why use 1x1 Conv2d of stride 2 in resnet block?

It’s an excellent point that has been brought up before, but none of the previous discussions have really found any explanation or justification for doing this. If the goal is to reduce the size of the output at a given layer, a pooling layer would also achieve that with less loss of information. Although you’d then need to follow that with a 1 x 1 Conv layer with stride of 1 to really get the same effect. Of course, that would be more computationally expensive. But exactly as you say, it seems strange to literally ignore half the inputs at various layers.

I have not taken the trouble to go read any of the papers on Residual Nets. The hope would be that they might comment on this aspect, but there is no guarantee. If anyone has the time and energy to pursue that, please let us know what you find!