Consider, for simplicity, we have to detect only one type of object.
For sliding window algorithm, we use a model trained on x1, y1 where y1 predicts object’s presence in image x1. But then, we apply this model to full images and we get output not 1 x 1 x u, but n x n x u, where n x n is number of cells and u is number of output units for a single cropped image.
But in the case of YOLO, we actually train the model on full images and output n x n x u, where n x n is number of cells and this output volume is basically output of all the grid cells stacked.
So the output that we use to train YOLO model is labeled by us, such that each grid cell has corresponding values for presence of object and values of b_x, b_y, b_h, b_w and we choose that cell to contain object, which has center of the object. The other cell may have some part of object but we choose to give ‘no presence of object’ as label.
Am I right about all this?
Yes, in YOLO, each object that is detected is reported as belonging to the cell that contains the centroid of that object. Of course there is nothing that prevents objects from spanning multiple cells, but they are reported only once for the cell containing the centroid. Meaning that you could have a cell that reports no objects, even though it contains parts of several objects that are centered in other cells.