Target Label Y with grid cells

I think this is a correct summary. It’s an engineering tradeoff. More grid cells means capability to detect more objects per image. But also since the entire prediction vector is computed for each grid cell + anchor box, computation and memory scale with ^2 of the grid size meaning there is a business-driven practical upper bound.

The ground truth labels are initially associated with the training input, X. Often they are provided in a text file and may be XML or JSON. For YOLO training, these labels must be mapped in a preprocessing step to a matrix sharing the network output shape - what you refer to as target label Y above. During training the ground truth labels are iteratively compared with the network generated predicted output, \hat{Y}.

There is some related discussion here: Week 3: finding the correct cell in YOLO

2 Likes