Effect of selecting kernel size

Hi. What is the effect of using different kernel sizes in convolution? If someone uses a smaller kernel size like 3 by 3 in the initial layer instead of using a 5 by 5 or a 7 by 7, What difference will it serve? What will be the effect of this on the classification accuracy? What is the effect of using the same kernel size in all the layers versus decreasing the kernel size as we go deeper into the network?

@TMosh Can you please answer my queries?


Kernel size is the “window size” the convolution operation will see when performing each individual operation. A bigger one will take into account bigger objects in the image, but at a higher computational cost.

A larger kernel size will also shrink the image dimensions more than a smaller one (if no padding), like the stride, but analysing every pixel in the image.

At the beginning all the features are more “spread out”, while after a few layers they are more condensed along the channel dimension, that’s why they’re usually smaller at the end of the model.

1 Like

Which is the better approach of using kernels , using same kernel in each convolution like VGG -16 or decreasing the kernel size as going deeper into the network like AlexNet ?What about “strides” ? what if I don’t use any strides or what will be the consequence of using a stride of 1 or 2?


There is no “correct answer” for all cases. Take into account that VGG family (2014) was released two years later than Alexnet (2012) and therefore got a better accuracy on the Imagenet challenge.

That being said, they are both quite old-fashioned and have been greatly superseeded by more modern approaches, like EfficienNet v1 (2019) and v2 (2021). I think they are both worth reading to get some clues about how to address that.

Regarding the second question, the main point of strides is the dimension reduction. If you want a light and fast model without sacrificing a lot of accuracy, increasing the stride in convolutions can be a good solution.