Question - Trying to understand how convolution operation works w.r.t input feature volume & output feature volume

Current Understanding - when we say 10 * 10 * 16 w.r.t. feature maps or kernels, it’s 10 * 10 as spatial dimension i.e. height * width & 16 represents number of such 10 * 10 feature maps or kernels
correct me if I am wrong in my understanding

now my doubt,
For example, If I have 10 * 10 * 6 input feature volume & I want 10 * 10 * 16 output feature volume (padding=same), if I am using a 3 * 3 spatial dimensional kernels, my questions are below,

  1. What will be the kernel volume required? I am guessing its 3 * 3 * 16 i.e. 16 different kernels of 3 * 3 spatial dimensions, but not sure.

  2. If it’s 3 * 3 * 16, how the convolution operation is taking place? like how we are getting 16 output feature maps by performing convolution between 6 (input features each of 10 * 10) & 16 (kernels each of 3 * 3)
    Theory 1 - Is it like one 3 * 3 * 1 kernel convolve over all 10 * 10 * 16 input channels to give me one 10 * 10 * 1 output channel, then the next 3 * 3 * 1 kernel convolve over 10 * 10 * 16 input channels to give another 10 * 10 * 1 output channel, & we are staking them to get 10 * 10 * 2 output channels after using 2 kernels? So after using all 16 kernels we get 16 output channels is it the case.

  3. If the Theory 1 is correct, how does the convolution operation takes place?
    for sake of simplicity, Say we reduce the input channels to 10 * 10 * 2, channel 1 - Red, channel 2 - Green,
    We reduce the filter to just one 3 * 3 * 1,

Now I know that when I put 3 * 3 * 1 filter on Red channel’s top left corner after performing convolution I get a single value as output, same will happen after performing convolution with same kernel on top left corner of Green channel, but now I have 2 values, one from each channel, I know that the output should be 1 value, the question is how we are getting from this 2 values to output 1 value? are we using some aggregation like avg/sum/max etc.?

some reference docs/links would be preferable if you could share,
thanks.

2 Likes

Please pick the correct week-x tag for your question.
Do look at the 1st assignment for week 1 where you implement the forward pass for a conv2d and pool layers. That should clear the doubts you have on this topic.

1 Like

If the inputs to a given layer are 10 x 10 x 6, that means you have 6 “channels” in the input. So at the current layer each filter must match the number of channels on the inputs. So if you choose f = 3, then each filter at the current layer will be 3 x 3 x 6. And if you want to have 16 output channels from the current layer, then you need 16 of those 3 x 3 x 6 filters. Then you compute the output shape in the h and w dimensions by using the formula Prof Ng gave us in the lectures:

n_{out} = \displaystyle \lfloor \frac {n_{in} + 2p - f}{s} \rfloor + 1

Applying that with s = 1 and p = 0 we get:

n_{out} = \displaystyle \lfloor \frac {10 + 2 * 0 - 3}{1} \rfloor + 1 = 8

In that case, the output of the current layer would be 8 x 8 x 16.

Or if you wanted “same” padding, try p = 1 and you get n_{out} = 10.

Thank you all for the help, I have also tried to find out the answer based on quick code. below are the results,