Convolution Confusion (YOLO/UNets)

Just wanted to reinforce that YOLO predicts the center location of an object in the image exactly the same way it predicts class, object presence and bounding box shape/size. That is, it is provided the correct center pixel coordinates at training time and then learns how to reproduce that as a prediction. What YOLO doesn’t do is reverse engineer the center location or the proper grid cell from the bounding box. That is done once, but during training data creation, not by YOLO itself, neither during training nor at runtime. Based on the image size and the chosen grid cell size, at training data creation time the object’s center coordinates are determined from the labelled training data, the grid cell computed by simple algebra, and then the object’s training data is embedded in that location of the Y matrix (along with best anchor box, which is computed at this time as well). Then at runtime, and this is the real key to grokking the YOLO idea, every grid cell and anchor box tuple (ie detector) simultaneously makes predictions. The grid cell location doesn’t have to be computed or ‘assigned’ at the end of forward prop. Rather it is explicit in the matrix location of the Y ground truth and the corresponding \hat{Y} where the prediction vector occurs.

If that doesn’t make sense, let’s discuss further.

Also note that this is why NonMaxSuppression must be used with YOLO. Since each detector location is making predictions at the same time, there can be what I think of as false positives…two or more predictions of the same object. Similarity measure helps NMS lower the count in that quadrant of the truth table due to duplicates. There can still be false positives due to other issues or if the IOU threshold in NMS is set too low.

2 Likes