Hi experts,
I need some clarfications on the 4th video titled " Convolutional Implementation of Sliding Windows":

At time 9:10 of the video, the left most input image size is 28x28x3, and Andrew mentions it’s
using strides=2. But after the first 5x5 CONV, the image is 24x24x16. Shouldn’t it be:
floor( (28 -5)/(strides=2) + 1 ) = floor(23/2 + 1 ) = 12 ?
The above video corresponds to course notes slide 13.

Also, around time of 10:02 of the video and course notes slide 14, how does input dimension
go from 28x28 to 16x16 with a 5x5 CONV? I must have missed something fundamental here?

My following bases on the 16 x 16 x 3 input, but the same logic applies to your question on the 28 x 28 x 3 input.

To begin with, we keep in mind that that one slide has discussed two methods to predict an image:

method A: Out of the image, manually slice 14 x 14 one at a time, and do such slicing at a stride of 2, we will end up with 4 14x14 images and 4 predictions.

method B: Not to slice, and just put the 16x16 input into the model and it gives also 4 predictions as a result.

Note that only method A has that so-called stride 2. That is the stride for manual slicing.

Now, if you look at the above convolution, it is the result of applying method B. A filter of 5x5 converts 16 x 16 to 12 x 12. That’s it! It’s method B, so it is not method A, so there is no stride of 2.

Similarly, if the input is 28 x 28, then the filter converts it to 24 x 24. Method B, that’s it, no stride of 2.

There is a known error. Please check out the reading item right before that video.

@rmwkwok I didnt understand why Prof. said that the stride of 2 came because of MAX POOL 2x2. Could you explain that ?

Can I assume the stride 2 calcualtion as below ?

The input in 28x28 and output is 8x8 (skipping the channels for the time being).
Also filter size is 14x14 as we are striding on a 28x28 image using our orginal 14x14 input size ConvNet.
So n =28, f=14, output is 8. S is unknown.

The formula (without padding) for calculating output size is \lfloor \frac{n-f}{s} {+1}\rfloor = ouput

Filling known values to compute S,

\lfloor \frac{28-14}{s}{+1} \rfloor = 8
rearranging terms \lfloor \frac{28-14}{s} \rfloor = {8-1}
above can be simplified to \lfloor \frac{14}{s} \rfloor = {7}
rearranging to calculate S \lfloor \frac{14}{7} \rfloor = {s}
which gives S= 2.

Trying on the 16x16x3 image input also gives the stride S=2.

@Mun_Chung_Wong Thanks for asking this question. I had same in my mind.

To understand that remark of Andrew’s, the best way is to repeat the conversions below and change the max pooling’s stride to other values.

max-pooling stride

output

effective stride

2

8x8

2 (as you calculated)

3

4x4

4

4

2x2

anything from 8 to 14

The max-pooling’s stride is not always equal to the effective stride, so the lecture’s example is a beautiful coincidence.

Andrew’s remark is correct in the sense that the effective stride has to do with the max pooling’s stride (as shown in the table, otherwise the last column won’t change with the first column), but it was not establishing a quantitiative relation.