Seeking for more clarification and my sincere apology if I might miss any crucial information in the lecture on Object Detection algorithms under DLS Course 4 - W3.
In case of finding the bounding box coordinates (b-x, b-y, b-h and b-w), how YOLO actually derives those coordinates? I do understand how Non-Max Suppression is then applied to pick the right one, but how YOLO derives the values at first place?
In the input data, we give all the values, bx and by as the center of the object and bh and bw as height and width. So, after training on it, the model then tries to find the height and width for a grid where the model considers the center of the object might exist.
Updated: In my above message, I mean in grid cells (where model think object center might be), model tries to predict the bh and bw for a bounding box.
As Gent and Saif have said, the YOLO algorithm learns that through training based on labelled data that includes the bounding boxes. If you want to dig deeper into how the algorithm works, there are a number of detailed threads about YOLO on the forums, e.g. this one would be a good place to start and it links to some others.
Not quite. Grids in YOLO are fixed size, determined before training starts, and their shape is not part of the network output. The predicted bounding box shape, b_w, b_h, can be smaller than, equal to, or larger than the grid cell in which it is located.
@Mithun_Kar
If you read the original papers carefully, or some of the several YOLO threads discoverable through the one linked by @paulinpaloalto above, you’ll see that YOLO doesn’t directly predict any of b_x, b_y, b_w, or b_h. Rather, the direct floating point values it outputs are subjected to further transformation to generate the location and shape coordinates. The inverse transformation must be performed when establishing the training data. Other than that, they are produced exactly the same way any neural network produces any floating point output. By that I mean labels provide ground truth values Y, the network generates predicted outputs \hat{Y}, and the loss function minimizes Y - \hat{Y} during training.
The expressions relating the b_{…} coordinates with the direct network outputs are discussed here:
Yes, you are right. In my above message, I mean in grid cells (where model think object center might be), model tries to predict the bh and bw for a bounding box. Pardon my vague words.