IF I CALCULATE LAYER WISE
First Convolution and Maxpooling
The input is an RGB image of 224 x 224 x 3
Conv2d layer 1 - Using 32 filters with a 5Ă—5 kernel (stride 1, no padding), the output spatial dimension is ( 224-5) + 1 = 220.
Output = 220 x 220 x 32 feature map
Maxpool 2d layer 1 - with a 2 x 2 kernel and stride 2, the dimension are halved
Output 220/2= 110
Output shape is 110 x 110 x 32 feature map.
Second Convolution and MaxPooling
Conv2d layer 1 - Using 64 filters with a 5Ă—5 kernel (stride 1, no padding), the output spatial dimension is (110-5) + 1 = 106
Output = 106 x 106 x 64 feature map
Maxpool 2d layer 1 - with a 2 x 2 kernel and stride 2, the dimension are halved
Output 106/2=53
Output shape is 53 x 53 x 64 feature map.
Third Convolution and MaxPooling
Conv2d layer 1 - Using 32 filters with a 3x3 kernel (stride 1, no padding), the output spatial dimension is (53- 3)+ 1 = 51
Output = 51 x 51 x 128 feature map
Maxpool 2d layer 1 - with a 2 x 2 kernel and stride 2, the dimension are halved
Output 51/2 = 25
Output shape is 25 x 25 x 128 feature map.
So input dimension for the first fully connected layer is the final feature map’s height, width, and depth
128 x 25 x 25