Window size in convolutional implementation of sliding windows

In the lecture, we saw how different window sizes can be applied in the sliding windows algorithm. However, in the convolutional implementation of sliding windows, there is no mention of window size. The window size is implicitly 14x14, aka the size of images that was used to train the CNN. How can we handle different window sizes in the convolutional implementation of sliding windows?

Please update your post with a link to the lecture / timestamp.

@balaji.ambresh I have attached image.

There are 3 rows of images in the slide:

  1. Top row shows how we train a CNN for predicting class of an image whose dimensions are 14x14x3.
  2. The middle row shows how we can use our convnet trained on 14x14x3 images to perform detection on 16x16x3 images.
  3. The last row shows a bigger example with 28x28x3.

Let’s consider the middle row:

  1. One way to perform detection on a 16x16x3 image is to manually isolate a 14x14x3 patch within the image and run our original convnet with that patch as input.
  2. When we use horizontal and vertical strides of 2, we get 4 outputs, each one corresponding to the output for that patch.
  3. Since a lot of computations are shared across these 4 patches, we can run the full 16x16x3 image as input through our convnet trained on 14x14x3 at inference time. This will give rise to a 2x2x4 output as output of the last fully connected layer.
  4. As far as outputs are concerned:
    a. Left top box of the output corresponds to the left top 14x14x3 portion of the 16x16x3 input to the convnet.
    b. Right top box of the output corresponds to the right top 14x14x3 portion of the 16x16x3 input to the convnet (the one with horizontal stride of 2 and vertical stride of 0).
    c. The bottom 2 are now self explanatory.

This approach shows that we don’t have to manually crop images to match input dimensions of the underlying trained cnn but we can interpret the outputs differently and make better use of the hardware.

Here’s an example:

import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(filters=16, kernel_size=5, input_shape=(None, None, 3)),
    tf.keras.layers.Conv2D(filters=400, kernel_size=5),
    tf.keras.layers.Conv2D(kernel_size=1, filters=400),
    tf.keras.layers.Conv2D(kernel_size=1, filters=4),
print(model(tf.random.uniform(shape=(1, 14, 14, 3))).shape) # (1, 1, 1, 4)
print(model(tf.random.uniform(shape=(1, 16, 16, 3))).shape) # (1, 2, 2, 4)

Thanks for your reply @balaji.ambresh.

I understand what is going on in the attached image. My point is whether we are using the 16x16 image or the 28x28 image, the window size being implicitly used is 14x14. What if I want to scan the image in 8x8 sections for example. One reason we might want to use a smaller window size is that the car might be far away in the image appearing smaller. We would want a smaller window size to narrow down the position of the car.

Accuracy of the outcome depends on input_shape of the underlying CNN.

But the car can be of varying sizes (further the smaller, closer the bigger). Shouldn’t we be able to detect many ranges of sizes of cars? The plain sliding windows algorithm seems to achieve this. Obviously that algo is too computationally expensive, so we used the conv sliding windows algo. While this approach is much faster for a given window size (input_shape as you said), it does not use multiple window sizes like the plain sliding windows algo failing to detect cars much bigger that 14x14 or much smaller that 14x14.

Your observation is partially correct. Computation aside, this method won’t be able to detect cars that are larger than the underlying CNN input size.
Please go through rest of the lectures and do the programming assignment on yolo.

Thanks for your reply @balaji.ambresh.

While it can potentially detect cars smaller than 14x14, the bounding box around that small car is too large (14x14). In the plain sliding windows algorithm we can potentially narrow it down further when we use window sizes under 14x14. So besides the computational inefficiency, the traditional plain sliding window algo seems to be better than the convolutional algo as far as narrowing down the car. Obviously YOLO is much better than both but I’m just referring to the 2 types of sliding windows algos.

About how to detect many ranges of sizes of cars using sliding windows this can be done by try many different kernel sizes…but how to use different kernel sizes, this could be done by

  1. create parallel network like inception module


  1. create more than one model with different kernel sizes and you could concatenate them in one final model

Finally this is too computationally expensive so we try to use more efficient algorithms like faster RCNN or Yollo

Thanks @AbdElRhaman_Fakhry for your reply. I think I have clarity on this now.