I’ve noticed that the images put into these Convnets have been trained on are rather small. I.E Yolo is 608*608.
I understand that you can downscale images down to these sizes to use them, but surely in the age of 100 megapixel phones we also will sometimes want to run a CNN on a larger pixel image.
If I am understanding ConvNets correctly, the first layer using a 3x3 filter or 5x5 filter is only looking at 9 or 25 pixels respectively per sliding window . While this might be a significant fraction of an image on a 608*608 image, this is clearly not true for a 100 megapixel image.
Then generally, as the Height and Width decrease with more layers, the convolutions are creating encodings of the image based on information from larger and larger/more parts of the image.
This might only take 5-10 layers for a 608x608 pixel image at 3x3 filter_size per layer, but what happens on a 100 mega pixel image? It would surely take a lot more layers at a 3x3 filter size to encode larger and larger features. because on a 100 megapixel image, 9 pixels wouldn’t be able to represent any kind of larger feature.
Do we generally choose bigger filter sizes? Or choose to make a network deeper so that there are more convolutional reductions? Both? Experiment? Any guidance on what is normally done?
Apologies if this doesn’t make sense. I don’t think it’s quite a fully formed thought just yet.