Filtersize and # layer Considerations for larger image sizes?

I’ve noticed that the images put into these Convnets have been trained on are rather small. I.E Yolo is 608*608.

I understand that you can downscale images down to these sizes to use them, but surely in the age of 100 megapixel phones we also will sometimes want to run a CNN on a larger pixel image.

If I am understanding ConvNets correctly, the first layer using a 3x3 filter or 5x5 filter is only looking at 9 or 25 pixels respectively per sliding window . While this might be a significant fraction of an image on a 608*608 image, this is clearly not true for a 100 megapixel image.

Then generally, as the Height and Width decrease with more layers, the convolutions are creating encodings of the image based on information from larger and larger/more parts of the image.

This might only take 5-10 layers for a 608x608 pixel image at 3x3 filter_size per layer, but what happens on a 100 mega pixel image? It would surely take a lot more layers at a 3x3 filter size to encode larger and larger features. because on a 100 megapixel image, 9 pixels wouldn’t be able to represent any kind of larger feature.

Do we generally choose bigger filter sizes? Or choose to make a network deeper so that there are more convolutional reductions? Both? Experiment? Any guidance on what is normally done?

Apologies if this doesn’t make sense. I don’t think it’s quite a fully formed thought just yet.

I think the key missing consideration here is, how large is each object? Our filter size cannot be unrelated to that? If objects take around 40 x 40 pixels in that 100 megapixel image, then applying a 3x3 filter to detect features of those objects isn’t unreasonable.

Put it in other way, if our filter size can be larger than the size of the objects, then in an 640x640 image where objects’ sizes are still around 40x40 pixels, why wouldn’t yolo use filters larger than 40x40?

This is really just a question for discussion and consideration.