Question about channel dimension matching


I would appreciate some help in understanding examples presented in the lecture covering an example of a simple convolution network.

In the example given by Dr Ng,

  1. He wanted to build a cat classifier.

  2. He has an input image of size 39x39x3.

  3. He fed the input image of 39x39x3 into the first convolution layer with:
    f = 3, s = 1, p = 0, and with 10 filters.

My question is:

How can you feed a 39x39x3 RGB (3 channels) into a 3x3 filter with 10 filters?

The channel number of image is Nc = 3 and the filter number Nc = 10. They are not matched. How did he get this to work?

Thank you!


@Kevin_Shey ,
I am not sure if I got your point, but the number of channels (3) means its a RGB image. Then the layer applies 10 filters to the input image. The number of filters determines the depth of the output tensor from this layer, or (37, 37, 10)

Also, each filter takes care of all channels of the image, not just one channel. Each filter has a shape of 3 x 3 x 3 (ignoring the bias).

Right, if there are 10 filters, then the shape of the W weight matrix is 3 x 3 x 3 x 10. So you step through the whole convolution 10 times with the ten different 3 x 3 x 3 filters. Then for each 3 x 3 x 3 filter you do the actual convolution stepping the filter across the h and w dimensions and convolving it with all 3 input channels at each position. The bias will be 1 x 1 x 1 x 10.

You apply this formula to compute the output dimensions:

n_{out} = \displaystyle \lfloor \frac {n_{in} + 2p - f}{s} \rfloor + 1

That applies to both h and w, of course, so with f = 3, s = 1 and p = 0, you get:

n_{out} = \displaystyle \lfloor \frac {39 + 2 * 0 - 3}{1} \rfloor + 1 = 36 + 1 = 37

Thank you everyone for the helpful response!