DLS - Course 4 - W3 - bounding box coordinates

Seeking for more clarification and my sincere apology if I might miss any crucial information in the lecture on Object Detection algorithms under DLS Course 4 - W3.

In case of finding the bounding box coordinates (b-x, b-y, b-h and b-w), how YOLO actually derives those coordinates? I do understand how Non-Max Suppression is then applied to pick the right one, but how YOLO derives the values at first place?

Thank you

It first gets trained with images that have objects that are given these coordinates for.

Once training is satisfactory it can then estimate for similar images i.e. output these numbers itself from what it has learned.

In the input data, we give all the values, bx and by as the center of the object and bh and bw as height and width. So, after training on it, the model then tries to find the height and width for a grid where the model considers the center of the object might exist.

Updated: In my above message, I mean in grid cells (where model think object center might be), model tries to predict the bh and bw for a bounding box.

Best,
Saif.

As Gent and Saif have said, the YOLO algorithm learns that through training based on labelled data that includes the bounding boxes. If you want to dig deeper into how the algorithm works, there are a number of detailed threads about YOLO on the forums, e.g. this one would be a good place to start and it links to some others.

1 Like

ok, thanks for your response. Much appreciated.

Thank you Saif. Much appreciated

Thank you Paulin. I will go through the links.

Not quite. Grids in YOLO are fixed size, determined before training starts, and their shape is not part of the network output. The predicted bounding box shape, b_w, b_h, can be smaller than, equal to, or larger than the grid cell in which it is located.

@Mithun_Kar
If you read the original papers carefully, or some of the several YOLO threads discoverable through the one linked by @paulinpaloalto above, you’ll see that YOLO doesn’t directly predict any of b_x, b_y, b_w, or b_h. Rather, the direct floating point values it outputs are subjected to further transformation to generate the location and shape coordinates. The inverse transformation must be performed when establishing the training data. Other than that, they are produced exactly the same way any neural network produces any floating point output. By that I mean labels provide ground truth values Y, the network generates predicted outputs \hat{Y}, and the loss function minimizes Y - \hat{Y} during training.

The expressions relating the b_{…} coordinates with the direct network outputs are discussed here:

Hope this helps

Thanks. I will go through. Very much appreciated

Yes, you are right. In my above message, I mean in grid cells (where model think object center might be), model tries to predict the bh and bw for a bounding box. Pardon my vague words.

Best,
Saif.

1 Like