Thank you very much for following up on this and resolving the questions!
As you say, if it’s only 3 out of the 50 layers where the downsampling happens, that loss of information apparently doesn’t spoil the results. If we were in an experimental mood, it might be interesting to try resetting the stride to 1 in those three cases and then following those layers with an average or max pooling layer and see if that makes a perceptible difference in the performance of the resulting models. That approach would increase the computational expense a bit, but lose less information.
Now that you’ve found the source to another implementation of Residual Nets, there was another really interesting technical question that came up in the last couple of weeks about how our implementation here in the notebook works: that concerns how it handles the “training” argument for the BatchNorm layers. Here’s a thread about that issue to see if it catches your interest!
Thanks again!
Regards,
Paul