There are 16 channels in the output of the conv layer, the inputs are 6 x 6 and we are doing a convolution with stride = 1, padding = 0 and filter size = 3. Here’s the formula for figuring out the h and w dimensions of the output:
n_{out} = \displaystyle \lfloor \frac {n_{in} + 2p - f}{s} \rfloor + 1
So we have:
n_{out} = 6 + 2 * 0 - 3 + 1 = 4