Current Understanding - when we say 10 * 10 * 16 w.r.t. feature maps or kernels, it’s 10 * 10 as spatial dimension i.e. height * width & 16 represents number of such 10 * 10 feature maps or kernels
correct me if I am wrong in my understanding
now my doubt,
For example, If I have 10 * 10 * 6 input feature volume & I want 10 * 10 * 16 output feature volume (padding=same), if I am using a 3 * 3 spatial dimensional kernels, my questions are below,
-
What will be the kernel volume required? I am guessing its 3 * 3 * 16 i.e. 16 different kernels of 3 * 3 spatial dimensions, but not sure.
-
If it’s 3 * 3 * 16, how the convolution operation is taking place? like how we are getting 16 output feature maps by performing convolution between 6 (input features each of 10 * 10) & 16 (kernels each of 3 * 3)
Theory 1 - Is it like one 3 * 3 * 1 kernel convolve over all 10 * 10 * 16 input channels to give me one 10 * 10 * 1 output channel, then the next 3 * 3 * 1 kernel convolve over 10 * 10 * 16 input channels to give another 10 * 10 * 1 output channel, & we are staking them to get 10 * 10 * 2 output channels after using 2 kernels? So after using all 16 kernels we get 16 output channels is it the case. -
If the Theory 1 is correct, how does the convolution operation takes place?
for sake of simplicity, Say we reduce the input channels to 10 * 10 * 2, channel 1 - Red, channel 2 - Green,
We reduce the filter to just one 3 * 3 * 1,
Now I know that when I put 3 * 3 * 1 filter on Red channel’s top left corner after performing convolution I get a single value as output, same will happen after performing convolution with same kernel on top left corner of Green channel, but now I have 2 values, one from each channel, I know that the output should be 1 value, the question is how we are getting from this 2 values to output 1 value? are we using some aggregation like avg/sum/max etc.?
some reference docs/links would be preferable if you could share,
thanks.