In course 4 week 2 motivation for inception network, the lecture mentions that when you apply the max-pool on 28x28x192 input image, it end up with 28x28x32 output matrix. How can this be possible? I thought the max-pool will be apply on each channel of input independently, which means that the number of channels in the output matrix should be the same as the number of channels in the input matrix (192). Am i misunderstand the idea of max-pool? Or is it just a typo? Thank you in advance for clarifying this!
Great catch. You are correct in that max-pooling should not change the number of channels.
I think there was an omission in part of the video, see this post on stackoverflow. Basically there was still a 1x1 conv applied after max-pooling to reduce the number of channels.
I’ve checked over the original paper to verify that the answer on stackoverflow is indeed correct.
I believe this was clarified at 2:10 in the video. And also have in my notes… We need to use padding to match dimensions.
Either of these seems to work using formula (((n+2p-f)/s) + 1) :
- n=28, p=0, f=1, s=1, nc= 32 → 28 x 28 x 32
- n=28, p=1, f=3, s=1, nc=32 → 28 x 28 x 32
Keeping in mind that the filter itself is actually f x f x 192 max-pool
So even if the max-pool filter is 1x1 we’re finding max across the 192 dimension. This filter effectively reduces to 28 x 28 x 1
Hello @ngkhatu,
Padding is for matching the spatial dimensions, whereas the OP was questioning for the channel dimension. Also, max-pooling only reduces the spatial dimensions, but not the channel dimension.
I think @hackyon’s findings can explain the change in the channel dimension.
Cheers,
Raymond