Output Shape of 2nd Conv2D layer

I had the same doubt Iratxe Moya posted in the Coursera discussion forum.
I am pasting the same here since it was left unanswered.

Iratxe Moya
16 days ago

Hello, I have a doubt about the output shape of the convolutions. I understand that if I am introducing N images to the model each of 28x28 pixels, after the convolution I would have 64 subimages for each N images, one for each filter of the convolution. This makes sense when I only apply one convolutional layer, but I can’t see it clear when applying a second convolutional layer after the first one (and after its Pooling layer, of course).

Let’s imagine i have the following model:

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)), # We are using 64 filters or 3x3 size, starting from some known good filters
	tf.keras.layers.MaxPooling2D(2, 2), # We are taking the maximum value from each 4 pixel (2x2)
	tf.keras.layers.Conv2D(42, (3, 3), activation='relu'),
	tf.keras.layers.MaxPooling2D(2, 2),
	tf.keras.layers.Dense(1024, activation=tf.nn.relu),
	tf.keras.layers.Dense(10, activation=tf.nn.softmax) # Where 10 corresponds to the number of different kind of images we have

If I execute a model.summary(), I get the following:

Here I can see that after the second convolution, the output shape is not 2688 (64x42) as I would think at first, but only 42. So how are the filters of the second convolution applied. I have thought that maybe each filter on the second convolution is applied to each previous convoluted subimage and then made some kind of average, or something similar, but I cannot see it clear. Any help?

1 Like

Hi Sreehari,

You and Iratxe raise a very valid question, it’s indeed a bit confusing. I find the article here to be very illuminating for this.

Just to highlight, for any convolutional kernel, the depth of a kernel will always be equal to the number of channels in the input array. It’s not explicitly stated at the time of model definition, but when in 2nd convolution layer you mention that it’s Conv2D(42, (3, 3), activation=‘relu’), basically, you are saying that we need to include 42 → 3x3x64 size kernel, where 64 is being inferred from the number of channels in the input coming in from the previous layer.

That’s why input of shape 28x28x1 is fed to 64x3x3x1 conv2D and get’s transformed to (26, 26, 64).
This when passed to MaxPooling converts it to 13x13x64.
Which is then passed to 42x3x3x64 conv2D kernel to produce 11x11x42 output.

Let me know if this helps.