DLS C4W3 4th video clarification needed please

Hi experts,
I need some clarfications on the 4th video titled " Convolutional Implementation of Sliding Windows":

  1. At time 9:10 of the video, the left most input image size is 28x28x3, and Andrew mentions it’s
    using strides=2. But after the first 5x5 CONV, the image is 24x24x16. Shouldn’t it be:
    floor( (28 -5)/(strides=2) + 1 ) = floor(23/2 + 1 ) = 12 ?
    The above video corresponds to course notes slide 13.

  2. Also, around time of 10:02 of the video and course notes slide 14, how does input dimension
    go from 28x28 to 16x16 with a 5x5 CONV? I must have missed something fundamental here?


1 Like

Hello @Mun_Chung_Wong,

Thank you for the clear references.

My following bases on the 16 x 16 x 3 input, but the same logic applies to your question on the 28 x 28 x 3 input.

To begin with, we keep in mind that that one slide has discussed two methods to predict an image:

  • method A: Out of the image, manually slice 14 x 14 one at a time, and do such slicing at a stride of 2, we will end up with 4 14x14 images and 4 predictions.

  • method B: Not to slice, and just put the 16x16 input into the model and it gives also 4 predictions as a result.

    Note that only method A has that so-called stride 2. That is the stride for manual slicing.


Now, if you look at the above convolution, it is the result of applying method B. A filter of 5x5 converts 16 x 16 to 12 x 12. That’s it! It’s method B, so it is not method A, so there is no stride of 2.

Similarly, if the input is 28 x 28, then the filter converts it to 24 x 24. Method B, that’s it, no stride of 2.

There is a known error. Please check out the reading item right before that video.


Thanks Raymond for the clear explanations.

And for my 2nd question sorry I just realized that additional correction link after posting my


You are welcome, @Mun_Chung_Wong!


4 posts were split to a new topic: C4 W3 Bounding Box Predictions - wrong assignments to bh and bw?

@rmwkwok I didnt understand why Prof. said that the stride of 2 came because of MAX POOL 2x2. Could you explain that ?

Can I assume the stride 2 calcualtion as below ?

The input in 28x28 and output is 8x8 (skipping the channels for the time being).
Also filter size is 14x14 as we are striding on a 28x28 image using our orginal 14x14 input size ConvNet.
So n =28, f=14, output is 8. S is unknown.

The formula (without padding) for calculating output size is
\lfloor \frac{n-f}{s} {+1}\rfloor = ouput

Filling known values to compute S,

\lfloor \frac{28-14}{s}{+1} \rfloor = 8
rearranging terms
\lfloor \frac{28-14}{s} \rfloor = {8-1}
above can be simplified to
\lfloor \frac{14}{s} \rfloor = {7}
rearranging to calculate S
\lfloor \frac{14}{7} \rfloor = {s}
which gives S= 2.

Trying on the 16x16x3 image input also gives the stride S=2.

@Mun_Chung_Wong Thanks for asking this question. I had same in my mind.

Hello, @Jeffrey_Antony,

To understand that remark of Andrew’s, the best way is to repeat the conversions below and change the max pooling’s stride to other values.


max-pooling stride output effective stride
2 8x8 2 (as you calculated)
3 4x4 4
4 2x2 anything from 8 to 14

The max-pooling’s stride is not always equal to the effective stride, so the lecture’s example is a beautiful coincidence.

Andrew’s remark is correct in the sense that the effective stride has to do with the max pooling’s stride (as shown in the table, otherwise the last column won’t change with the first column), but it was not establishing a quantitiative relation.


1 Like

Thank you for the explanation.

1 Like