Question on Convolution Kernal Sizes

In assigment one (week 2) you write:

  • Zero-padding pads the input with a pad of (3,3)
  • Stage 1:
    • The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2).
    • BatchNorm is applied to the ‘channels’ axis of the input.
    • ReLU activation is applied.
    • MaxPooling uses a (3,3) window and a (2,2) stride.
  • Stage 2:
    • The convolutional block uses three sets of filters of size [64,64,256], “f” is 3, and “s” is 1.
    • The 2 identity blocks use three sets of filters of size [64,64,256], and “f” is 3.
  • Stage 3:
    • The convolutional block uses three sets of filters of size [128,128,512], “f” is 3 and “s” is 2.
    • The 3 identity blocks use three sets of filters of size [128,128,512] and “f” is 3.

I passed the coding fine, and I understand the network architecture. What I don’t understand is why in stage 2 and stage 3, the filters are size 64, 64, 256 and then double. Isn’t the size (64, 64) too big for the actual size of the network at that point.

The actual layers size when we look at the model is:

None, 15, 15, 64)
(None, 15, 15, 64)
(None, 15, 15, 64)
(None, 15, 15, 256)
(None, 15, 15, 256)
(None, 15, 15, 256)
(None, 8, 8, 128)
(None, 8, 8, 128)
(None, 8, 8, 512)
(None, 8, 8, 512)
(None, 8, 8, 512)
(None, 4, 4, 256)
(None, 4, 4, 256)
(None, 4, 4, 1024)
(None, 4, 4, 1024)
(None, 4, 4, 1024)

So the last number follows the number of channels that we setup, but the layer size is decreasing. Except we are using 64, then 128, then 256.

Why do we specify the convolutional layer to have larger and larger size (64,64), then 128,128

Thanks in advance,

I think the issue is just that you are misinterpreting the meaning of that parameter that is specified as [64, 64, 256]. Those are not the sizes of the filters in the sense of f: those are the number of output channels for 3 different layers. So in other words in Stage 2, you invoke the convolutional_block function with [64, 64, 256] and f = 3, so you end up creating 3 separate convolutional layers with the following total “filter sizes”:

3 x 3 x 64
3 x 3 x 64
3 x 3 x 256

What you end up with as the actual output size after that you’ll have to compute by taking the stride, padding and input image sizes into account. That’s what you see in the actual layer sizes that you show later.

Take a look at what the logic in convolutional_block actually does with that input.

As to the question of why the number of channels goes up as you proceed through the network, that is the general way that ConvNets work. Prof Ng discusses this at a number of points in the lectures, but you can think of the process being that the geographical area is being reduced and “distilled” down to the detection of more and more features as you proceed through the network.

There is a really interesting lecture in Week 4 titled “What are Deep ConvNets Learning” in which Prof Ng shows and explains some really cool work that gives us a way to visualize what the inner layers of the network are actually detecting. Even if you haven’t gotten to Week 4 yet, I think that lecture would still make sense and is definitely worth a look either now or when you get there or both. :nerd_face:

Thank you, this was really helpful. I understand it now. I sometimes find a gap between the theory and what I understand of the code. I appreciate it!