I have seen Andrew’s lectures many times over and over and still struggling to understand all the details about YOLO mostly regarding what happens when an object spans more than one grid cell. The only answer I find repeatedly given this question is the canned lines “only one grid cell is responsible for predicting the object” or “object is assigned to only one grid cell” without giving any detail as to how this is actually accomplished. Bit question is… when number of grid cells is more than 3x3, say 19x19, it is quite clear that more than one grid cell will predict the objects with high probability. But during training of YOLO we are forcing it to stop predicting the object in cells to it was not “assigned”. In other words, aren’t we training it to ignore major parts of object that fall in non-assigned grid cells, and making it just use the part that falls in “assigned” grid cell? Wouldn’t that confuse the network and ultimately degrade its prediction ability???
The training method of assigning an object to a single, center-based grid cell is not about forcing the network to ignore pixels. Instead, it is a highly effective training regularization that forces the network to:
-
Unambiguously identify the location of the object’s center.
-
Use its broad receptive field to aggregate information about the entire object (even the parts in neighbor cells).
-
Produce only one high-confidence prediction per object, greatly simplifying the output and improving speed.
While non-assigned cells are trained to be silent (Objectness=0), their visual input still contributes indirectly to the final accurate prediction made by the responsible cell.
Also see
And
Or even
So you are saying that :
- Objectness is zero for all neighboring gridcells when centroid does not lie in them EVEN when they contain large parts of the object?
- Objectness is 1 for gridcell in which centroid lies and it considers/uses large parts of the object from neighboring gridcells due to convolutional application of sliding window as discussed in this slide?
Yes, and Yes.
@aiai1 I agree with these assertions from @TMosh . However there are nuances.
First, the assignment of 1 and 0 as the objectness value happens at labeled training data creation time; population of Y. The output of forward propagation, \hat{Y} , contains predicted values, not assigned values.
Second, YOLO works because the input to forward propagation is the entire input image, X, not a sub-region, which is the case in sliding windows. The grid cells are not decompositions of that input. They influence the shape of the network output, not its input. Labeled ground truth bounding box shapes in Y are not constrained to lie within a single grid cell, and neither are the predicted bounding boxes in the network output \hat{Y}
Finally, despite best efforts, training and runtime prediction are imperfect. It is possible that more than one grid cell of \hat{Y} will end up with a non-zero objectness predicted value (and bounding box shape and location and class) for the same object in the input image. During training the cost function attempts to correct this. At runtime, we use non-max suppression to disambiguate.
Yet another example of how, actually, this is accomplished. YOLO is a step up in complexity from what has been covered previously in the lectures, hopefully these historical threads help clear things up a little.
