Inference and crop size


I tried looking information on this online, but couldn’t find anything.
How does a computer vision model inference work when the crop size is smaller than the image size?
For example, let’s say that I have a model with a crop size of (640,640) but the data I use to train, and eventually infer the model is full HD images.
I couldn’t find a strides setting for inference. Does this mean that the model will run 6 times to infer the image?
I’m trying to understand if increasing the crop size to (540,960) would actually speed up my inference process since it will only have to run 4 times instead of 6.

I’ll appreciate your thoughts

Hi fabiansc,

I just happened to run into your post. In case you are still wondering, maybe this post helps?

Hi @reinoudbosch !
Thanks for your response! I’m not exactly sure it does. I’m still trying to figure out what’s the smartest way to go about how to choose the right crop size (input layer of a FCN, if you will) for a fixed image size.
At some point I’ll do an ‘academic’ research and test how the crop size dimensions impact the model running speed.

Ah, if I now understand correctly, you don’t want to decrease the resolution of the original image to fit the model. Of couse, an alternative would be to increase the size of the input layer. I guess it depends on the environment you want to train and run the model in. In that case, experimenting may be the way to go.

As an added point, it may also be interesting to look into vision transformers.

Hi @reinoudbosch,

What I’m trying to do is to get the best bang for buck :slight_smile:
I’m working on something that has to be almost “pixel accurate” which means I need a heavy model.
Now given that, I’m trying to optimize the crop size so that I get a model that is as accurate as possible while not requiring 20x A100 GPUs to run inference in realtime :slight_smile: If I can cut it down by 30-50% just out of games with the batch size, that would be amazing

Ok, I think I now get the issue. I am not an expert on this, but maybe this article can give you some ideas? It points towards lightweight scanner networks and vision transformers. Unfortunately, I guess this is the best I can suggest for now.