One thing that is unclear is how Conv2D layers handle inputs with more than one channel. That is, what if the shape of an input image were (28,28,n) rather than (28,28,1)? In the example from the class the second convolutional layer receives its input from the first convolutional layer followed by MaxPool2D over (2,2) blocks. If the first Conv2D layer has 32 kernels this will have shape (13,13,32) - so in effect it is like a tiny image with 32 channels. The second Conv2D layer, which also has 32 kernels, has output shape (13,13,32) again rather than (13,13,1024) - it is as if the 32 channels of the input are in effect reduced to one. I gather from digging around that Conv2D will simply convolve each of the (32 here) input channels with the same kernel, apply the activation function to each, and output the sum. QUESTION 1: Is this accurate? If so, then I have a complaint: it seems that it would defeat the purpose of having many kernels if the next layer will effectively just see an average of the outputs for all those kernels. Another way of saying this: suppose your image is rgb. Then it would not be possible to have kernels learn to pick up features that are specific to one of those three colors. QUESTION 2: how can you stack convolutional layers so that each kernel will be applied separately to the outputs of each kernel in the previous layer? So for example if the input had one channel and first convolutional layer had 4 kernels, and the second convolutional layer had four kernels, I would expect the output of the second layer to have 4*4=16 channels. With the actual implementation shown the output would have 4 channels again. To paraphrase: what I’m asking for is a way to have a convolutional layer produce separate outputs for each of the input channels, with each kernel shared between those channels.
Hello @MatthewTCushman!
I believe taking DLS Course 4 ( Convolutional Neural Networks) will answer all your doubts. In this course, Prof. Andrew explains everything about CNN from scratch with 3D illustrations.
The TensorFlow course only teaches TF, not the theory behind CNN. So, I guess this DLS course 4 is a good place to start.
Best,
Saif.
Wanted to add: I didn’t realize that all the separate kernels for each channel are in fact present - see the weights for each of the three Conv2d layers at the bottom, as retrieved by get_weights(). There the first two dimensions are kernel size (3x3), the next dimension is the number of input channels (didn’t realize this wasn’t fixed to 1) and the last is the number of kernels. My question is still relevant because the actual outputs from the layer do not have this extra dimension corresponding to the number of input channels. I’m still assuming some averaging takes place. Also I’m now aware there may be a parameter for the number of output channels that perhaps defaults to 1?
model.summary()
output:
Model: “sequential”
Layer (type) Output Shape Param #
conv2d (Conv2D) (None, 148, 148, 16) 448
max_pooling2d (MaxPooling2D (None, 74, 74, 16) 0 )
conv2d_1 (Conv2D) (None, 72, 72, 32) 4640
max_pooling2d_1 (MaxPooling (None, 36, 36, 32) 0 2D)
conv2d_2 (Conv2D) (None, 34, 34, 64) 18496
max_pooling2d_2 (MaxPooling (None, 17, 17, 64) 0 2D)
flatten (Flatten) (None, 18496) 0
dense (Dense) (None, 512) 9470464
dense_1 (Dense) (None, 1) 513
=================================================================
Total params: 9,494,561
Trainable params: 9,494,561
Non-trainable params: 0
print(model.layers[0].get_weights()[0].shape)
print(model.layers[2].get_weights()[0].shape)
print(model.layers[4].get_weights()[0].shape)
ouput:
(3, 3, 3, 16)
(3, 3, 16, 32)
(3, 3, 32, 64)