Understanding of Conv2D

I don’t understand the first paramether in Conv2D function.
I’ve found an answer on stack overflow machine learning - What is the number of filter in CNN? - Stack Overflow but the thing I don’t understand is that if we have 64 filters, then why do we have 1 image in the output? Shouldn’t it return 64 images based on these 64 filters? Or is there a logic that picks 1 image that is more accurate and greater for the learning process out of these 64?

It sounds like the problem with the TF courses is that they assume you already understand the definitions of the various kinds of networks and how they work and then they are just showing you how to build those in TF. I would suggest that you might want to take DLS C1, C2 and C4 before you proceed here. In particular DLS C4 explains ConvNets.

Or go find some videos on YouTube that explain how ConvNets work. Here’s a quick sketch of how it works:

Suppose your inputs are RGB images of size 256 x 256 pixels. That means you also have 3 input color channels, so each input is a 3D tensor with shape 256 x 256 x 3. Now suppose you want to apply a Convolutional filter to that image. Let’s suppose we use a 5 x 5 filter with a stride of 1 and “valid” padding, meaning no padding. That means each filter is a 3D tensor of shape 5 x 5 x 3, because the channels of the filter need to match the channels of the input. Then you “step” that 5 x 5 x 3 filter across and down the image and you get an output size that is determined by this formula:

n_{out} = \displaystyle \lfloor \frac {n_{in} + 2p - f}{s} \rfloor + 1

That applies to both h and w, of course, so you get:

n_{out} = \displaystyle \lfloor \frac {256 + 2 * 0 - 5}{1} \rfloor + 1 = 251 + 1 = 252

So the output created by each individual filter will be 252 x 252 x 1. Then if you have 64 separate filters, the final result for each input image will be 252 x 252 x 64. You “stack” the outputs of the individual filters to form the full output. Of course there were a bunch of arbitrary choices we made there: the size of the filters, the stride, the padding and the number of total filters.

The other high level point here is that you just initialize all those filters randomly for symmetry breaking and then they learn whatever they need to learn through back propagation. Because they all start out different, they will also (with high probability) learn different things. Of course this is just one “conv” layer. You then need to compose an entire network model, which will probably involve several conv layers with pooling layers and perhaps some fully connected layers at the end, depending on what your goals are.