In the week2 Inception Network lecture, the output of max pooling operation on a volume of 28*28*192 is shown to be 28*28*32. But in pooling operation the depth of the input and the output remains unchanged.
Then how is it possible for the pooling operation to take an input of volume 28*28*192 and provide an output shaped 28*28*32.
So, Prof. Ng actually addresses this in the lecture when talking about the Max pooling (also, these are just examples to explain the motivation for the Inception network, so they might be a bit strange). The quote is as follows:
Now in order to make all the dimensions match, you actually need to use padding for max pooling. So this is an unusual formal pooling because if you want the input to have a higher than 28 by 28 and have the output, you’ll match the dimension everything else also by 28 by 28, then you need to use the same padding as well as a stride of one for pooling.
So, with same padding and stride of 1, the output shape after applying the pooling[0] would be:
Hi in the example we have an input layer of 28X28X192, in order to preserve the dimensions we apply zero padding thats true by making stride=1 and but how does this affect the number of channels. They will remain same since max pooling is applied on 192 channels of input so we get a 28X28X192 max pooled layer and not 28X28X32 layer.
Dimensions of output of max pool are 28X28X32 because we are using 32 filters as number of channels in the output are equal to the number of filters/kernels we use.
There is a Clarifications reading section before the video. This clarification should be added there with a note that later in the video(s) there is a note that 1x1 CONV is used to reduce the number of channels to 32 after applying MAX-POOL
Hi,
I have the same issue as @Abidhasan, so I’ll try to clarify: the problem is not with the width or height. The problem is with the depth!
As far as I learned, max pooling doesn’t change the depth.
It doesn’t work ‘volumetricly’ like convolution filters. It works on each depth channel separately. Therefore, if the input has depth=192, so the output has depth=192.
Therefore, output depth can’t be number of filter as in convolution filters.
The answer is in that reply from @Csaba_Aszalos that you quoted:
In other words, you’re exactly right about how max pooling layers work, but it’s not a pure max pooling layer there: it’s a max pooling layer followed by a 1 x 1 convolution to reduce the number of channels.