Why use grids at all in the YOLO Algorithm? Why can’t we just treat the input image itself as 1 big cell and have (bx, by, bh, bw) for multiple cars/objects based on that one cell. It would presumably be the same as having multiple cars/objects in one of say 19x19 grid cells and we can handle it the same way when the input image has multiple objects.
I think this is because the computation on a large image is much more expensive thats why you separate it in regions that you can easily manage!
Hmmm. But YOLO doesn’t actually divide the input image into smaller sub-images…that’s sliding windows, right? YOLO inputs the entire image into the network at once.
I might think of it as YOLO doesn’t divide the input into multiple small regions, but it does divide output into multiple small regions in order to support a fixed maximum number of object predictions per image (SSB for v2)
We split the images into grids so we can have our output (y) be more structured with some fixed number of predictions. Is that right?
But then, if we treat the entire image as 1 cell and say we have 3 total possible classes - why can’t we just have, a fixed number of predictions (say 30 → 10 predictions per class). This would also have a fixed maximum number of object predictions per image.
Any clarification would be highly appreciated.
My intuition is the single (or even merely fewer, say 3 or 5) cell divisions would adversely impact localization training and prediction accuracy. But I haven’t thought this all the way through and at the end of the day it’s an empirical decision. Do some experiments, see what works best in training and in the operational environment taking into account costs of computation, memory, throughput, etc. These are all just engineering tradeoffs one can choose among, right? Go where your data takes you.
I see. If I understood correctly you’re saying neither design is right / wrong - it’s just a matter of what works better in practice. So there could be a hypothetical scenario where using just 1 large cell might make sense. Is this a reasonable assumption?