Convolutional Sliding window example only works for a stride of 2

A similar question was asked here with no satisfactory answer yet.

So, I am re-posting this question again. Andrew slides the 14x14 window on the 16x16 test image with a stride of 2. If I slide it with a stride of 1, I should end up with a 3x3 final shape (ignoring the number of channels) based on 1+ (16-14)/1 = 3. But simply passing the 16x16 image through the trained architecture results in a 2x2 final layer as shown in the video.

So how do we understand this discrepancy?

If I train a network with a 14x14 input shape, will that restrict the stride of sliding the window depending on the shape of the test image on which we will make predictions?

Hello @Nilanjan_Banik,

Here is the way I would suggest you to think about it, please stay with me for a while:

  1. Forget about sliding at a stride of 2 over the 16 x 16 input first. Forget about it for now.

  2. The first Conv2D layer of the model is set to a stride of 1. Verify this yourself by reasoning how the shape changes from 14 x 14 to 10 x 10 (in the first row), and from 16 x 16 to 12 x 12 (in the second row).

  3. Verify the rest of the shapes in the slide, so that you can tell yourself, no matter what the input shape it is, all operations are the same with the same stride settings, and yet they produce the outputs as shown in the slide

  4. Realize that there are 2 ways you can make predictions on an image larger than designed. (A) Manually slicing 14 x 14 out of the image once at a time, and if we do it at a stride of 2, we will end up with 4 14x14 images and 4 predictions. (B) Not to slice, and therefore there is NO stride of 2, and just put the 16x16 input into the model and it gives also 4 predictions as a result.

  5. If you use method A with a stride of 2, you get 4 predictions.

  6. if you use method A with a stride of 1, you get 9 predictions.

  7. if you use method B (no need to specify any stride), you get 4 predictions.

Don’t mix up method A and B. If you use method B, you don’t (can’t) specify any stride of sliding window, and you get 4 predictions as a result of the model’s configuration (as verified by you in my above step 1 and 2).

In my above step 0, I asked you to forget that thing because mixing up method A and B is wrong.

As said above, if you use method B, there is no such thing as the stride of sliding window. If you use method A, however, there is no restriction on that stride.


Thanks @rmwkwok , your response is very helpful! I agree with all your points.

So, to summarize, method A and Method B are just two distinct ways of making predictions on a larger image. Method A is computationally more expensive since it repeats the same computations multiple times, while method B is much more efficient in the sense that it deals with all the computations through the network in a single pass.

Hello @Nilanjan_Banik,

You are welcome, and your summary is perfect.


@rmwkwok What happens with different window sizes? Here we are assuming sliding window size of 14x14. What if we want a window size of 10x10 or 20x20? The CNN is trained to only handle images of size 14x14. Doesn’t sliding window algorithm have to work on images of many sizes as not just 14x14. Would we have to train as many CNNs as window sizes we are using?

Thanks in advance for the reply!