Number of channels not a multiple of the number of channels of the input

Hi,

In AlexNet, the architecture of two following layers is:

  • Layer 6 : 13 x 13 x 384
  • Layer 7 : 13 x 13 x 256

How does it work since 256 is not a multiple of 384? In layer 7, filters are applied to 2/3 of the input channels. Which ones? How is it handled? This should lead to additional hyperparameters, shouldn’t it?

Same question also applies from layer 2 (96 channels) to layer 3 (256 channels) : 256 is not a multiple of 96.

Thanks

It sounds like you are missing a fundamental point about how ConvNets work: each filter at a given layer has the same number of channels as the input to that layer, but how many filters you have at that layer determines how many output channels you have. So there is no predefined relationship between input and output channels: it is purely a decision that you make as the system designer. The number of output channels is what Prof Ng calls a “hyperparameter”, meaning it is a choice you need to make as the system designer.

Thanks for answering. I now understand why I was wrong.
The use of channel terms and filter is sometime ambiguous in the videos with nc being used both for the number of channels or the number of filters.
Example : in the “Convolutional Neural Networks > Week 1 > Convolution over volume” video at 8’20’’, Prof Ng says :
“So, let’s just summarize the dimensions, if you have a n by n by number of channels input image, so an example, there’s a six by six by three, where n subscript C is the number of channels, and you convolve that with a f by f by, and again, this should be the same nC, so this was, three by three by three, and by convention this and this have to be the same number.
Then, what you get is n minus f plus one by n minus f plus one by and you want to use this nC prime, or its really nC of the next layer, but this is the number of filters that you use. So this in our example would be be four by four by two.”

Actually this is not correct., Pr Ng should not say “the number of filters that you use. So this in our example would be be four by four by two.” but “the number of channels in the output of the layer. So this in our example would be be four by four by six.”

This made me believe that both words could be used interchangeably.

The AlexNet article refers to kernels. It is clearer.

Thanks and congrats for all those videos, it is an awesome content.

I listened to the lecture you point to again. What Prof Ng says is correct. You are just misinterpreting what he is saying. Each “filter” must match channel dimension of the input. You get to chose the filter size (the f value), but since the input in this case is 6 x 6 x 3, that means each filter in the current layer must be f x f x 3 with f also = 3 in the example he gives. So each filter will give you a 4 x 4 x 1 output. Then if you choose to have a total of 2 filters (each of size 3 x 3 x 3) and apply them, the total volume of the output is 4 x 4 x 2. One channel of the output comes from each filter.

Note that one “step” of a convolution over volume is the elementwise product of all the dimensions of the filter with that h x w position of the input across all input channels and then adding up those products to get a single scalar value. So with a 3 x 3 x 3 filter, that is 27 products added up to get one output for each of the 4 x 4 output positions. That’s because the output h and w dimensions are computed from the input by the familiar formula:

n_{out} = \displaystyle \lfloor \frac {n_{in} +2p - f}{s} \rfloor + 1

Since Prof Ng is using no padding and stride of 1 in this example we have:

n_{out} = 6 - 3 + 1 = 4

Many thanks, this is quite clear now for me.
Actually, I had a bad recall from week 1 video of the convolution volume, forgetting to sum up all products (shame on me…).

Glad to hear it makes sense now. One other thing to be aware of is that pooling layers work in a different way: they operate “per channel”.

1 Like