C4W3 video 5; Question about stride in Convolutional implementation of sliding window

What if the stride of the sliding window is different than that of the first convolution? and what if it is different than that of the pool layer? or should they always be the same? (first convolution is of stride 1 and max pool is of same stride of sliding window?)

In CNN, the stride of the sliding window for the first convolutional layer, subsequent convolutional layers, and pooling layers can all be different. They don’t have to be the same.

1 Like

The video is about object detection using a sliding window and focuses on using convolution instead of the sliding window for detection. I apologize for the confusion regarding the detection sliding window.

If the stride of the max pool layer is 2 and the stride of the sliding window is also 2, then the case presented in the video is fine. However, if the max pool’s stride is different, what will happen? Will it still work? The same question applies to the stride of the first convolution layer. Will the algorithm work if the stride is not 1 for the first conv layer?

Hey @Mohamed_Akram,

As @Juan_Olano said the stride of the siliding window for the first conv layer and other layers they don’t have to be the same.

But let’s address your question:

  1. Stride of Max Pool Layer Different from Sliding Window :
  • If the stride of the max-pooling layer is different from the stride of the sliding window, it can still work, but it will affect the spatial resolution of the feature maps.

  • A larger stride in the max-pooling layer (e.g., 2) compared to the sliding window (e.g., 1) will reduce the spatial dimensions of the feature maps. This means you’ll have less detailed feature maps to work with in subsequent layers, which may impact the algorithm’s ability to precisely localize objects and detect smaller objects.

  1. Stride of First Convolution Layer Not 1 :
  • If the stride of the first convolutional layer is not 1, it will also impact the algorithm’s ability to capture fine-grained details and localize objects.

  • And as i mentioned above a larger stride in the first convolutional layer means that the initial feature maps will have reduced spatial resolution, which can make it more challenging to detect objects accurately, especially small or closely spaced objects.

In practice, many object detection architectures, typically start with a small stride (e.g., 1) in the first convolutional layer to capture fine-grained features and details.

I hope it make sense now.
Regards,
Jamal

1 Like