- input 32x32x3 => 32x32 is width * height and 3 is RGB
- f=5, filter 5x5
- 28x28x6 => (32-4) *(32-4), but no idea where comes 6?

The way Conv layers work is that you get to choose how many “filters” you define in a given layer. Each filter will have the same number of channels as the input to that layer (3 in this case). The number of output channels is determined by how many filters you choose to have, which is 6 in this case. If the size of the filter is f = 5, then each filter will be 5 x 5 x 3, because the input has 3 channels, right? Then each filter outputs a 28 x 28 x 1 output and there are 6 of those stacked together.

So 6 is a “hyperparameter” meaning a value that the system designer (Yann LeCun in this case) chose and then experimentally verified as a good choice.

BTW I have not taken the TF specialization, so I don’t know how much they explain about ConvNets. Perhaps they are assuming you’ve already taken DLS Course 4, which explains in detail how ConvNets work.

Also note that the formula for determining the size of the h and w dimensions is:

n_{out} = \displaystyle \lfloor \frac {n_{in} + 2p - f}{s} \rfloor + 1

In this case it is using “valid” padding (p = 0) and stride of 1, so that gives:

n_{out} = 32 - 5 + 1 = 28

Thank you very much. Your explain let me have chance to understand it in another perspective.