After the max-pooling the output is 5x5x16, is it understood the stride is 2, or did I miss something?

Should the output not be ( n-f+2p/s + 1) ?

If f = 5, p = 0 and s = 1, then we have:

\lfloor \displaystyle \frac {14 - 5 + 2*0} {1} \rfloor + 1 = 9 + 1 = 10

right?

Hi Paul, Iâ€™m sorry for editing the question, I realized I unconsciously added a stride of 2 while calculating the first output. Could you help understand the output to the max pooling layer too?

Yes, there is no rule that the stride has to be the same at every layer. So itâ€™s 1 for the first conv layer, but at the max pooling layer itâ€™s f = 2 and s = 2, which is one of the standard pooling choices.

\lfloor \displaystyle \frac {10 - 2 + 2*0}{2} \rfloor + 1 = 4 + 1 = 5

The one key difference to note with pooling layers is that the operation is always done â€śchannelwiseâ€ť meaning â€śper channelâ€ť, so the number of output channels stays the same. Of course that is *not* the way conv layers work.

Thanks a lot for the quick and precise answers, Paul!