Output grid cells for YOLO, Sliding Window

Given an input image of some dimensions (pixels), How many pixels should be contained in each grid cell of the output?

There is no magic number or rule for how big an image to use, nor for how many grid cells. And since pixels per grid cell depends on those two inputs, no rule for that, either. The darknet YOLO code used in this class requires square images, down samples 32x in the network, and uses a 19x19 cell grid. The original YOLO paper used 7x7. YOLO 9000, aka v2, used 13x13. Hope this helps.

1 Like

The general answer is, “As many as are necessary to get good enough results”.

You have to experiment with it.

1 Like

I agree with this statement. As always, there are tradeoffs. The larger your grid cells, the fewer total predictions you make each forward pass, the fewer computations, the faster it runs. However, it will give worse accuracy on objects much smaller than the grid cells. Or, the smaller your grid cells, the smaller objects it can handle, but the more predictions per forward pass and the slower the throughput. Balance grid cell size with likely object size, operational frame rate and accuracy requirements. Ain’t no free lunch.