Convolutional Implementation of Sliding Windows

Can anybody clarify why are we using a stride of two in the sliding windows for test images?


Using larger values of stride reduces the number of convolutions, so it takes fewer computations.
The down side is that the output is smaller, so may not contain as much information.

These are tradeoffs you can use in designing your CNN.


Thanks. For the original training image, Andrew didn’t make boxes as you can see in the above image, but for the test image, as you can see, he makes slices of it and strides it with a value of 2 for the sliding window. That’s what I’m specific about, I understand the concept you shared in the comments, but here I’m confused. Could you clarify a bit about it?

Hello @Muhammad-Kalim-Ullah,

First, you have trained this model which accepts only 14 x 14 x 3 images:

Then, you have a test image which is 16 x 16 x 3. What can we do about it? It is larger than the acceptable size for the model. However, we don’t want to just crop the images to 14 x 14 x 3 because we are losing the opportunity for the model to identify something in the cropped region. Then one idea is to pass only a slice of size 14 x 14 x 3 into the model at a time but pass 4 slices in total.

We pass 4 slices in total in order for every pixel to at least have one chance of being predicted by the model, and 4 is the minimum number of slices in order to achieve that goal. It is true that the 4 slices share a lot of common pixels, but each slice also has some pixels that the others don’t have.

However, it is also because they share a lot of common pixels, when they pass through the model, a lot of computation is “wasted” on repeating processing those pixels. Therefore, Andrew suggested the “sliding window” approach which is computationally cheaper, and can produce predictions for each of the 4 slices.

Is this clear?


1 Like