YOLO algorithm DLS COURSE 4

I am having trouble understanding YOLO. Firstly how does it calculate the midpoint of an object and assigns it automatically to a grid cell.
Secondly how can there be multiple bounding boxes for 1 object?

Hello,

For your first question:

For your second question:

I hope these posts help!

I think the more you read about YOLO, the more you gradually come to understand it. However, be very cautious with stuff you read on the internet, including the many posts I have done on this platform. Read it, compare it to your own understanding, READ THE CODE, try to implement it yourself. Repeat.

For example, the first post linked above says in part … The loss function … only considers a bounding box from thus identified grid cell but ignores the bounding boxes predicted by neighbouring[sic] grid cells, even if the object spills into them. This is not correct. The loss function considers every prediction from every grid cell and anchor box, including ones made erroneously. That is, even when a detector thinks the object is centered in the location it is responsible for but actually it is centered in the one next to it.

Here are my own responses to your questions:
First, regarding the midpoint of an object, there are two contexts. At training time, YOLO must be provided with a label that contains the ground truth bounding box location and the type (class) of the object at that location. Depending on how the label was created, it might already contain the midpoint location and shape, otherwise, if it is provided in rectangle coordinates, the center can be easily determined by finding the width and height of the bounding box, dividing by 2, and adding to the corner. Then, at runtime, YOLO doesn’t ‘calculate’ the midpoint; it predicts it. That is, the neural net outputs a numeric value that is fed to the sigmoid activation function which in turn produces a value between 0 and 1. That is its prediction for the offset of the object from the center of a grid cell. Every grid cell makes this prediction every time the neural net runs a forward prediction. The neural net is trained by comparing (computing the error) of that prediction to the one that was provided in the ground truth label.
So, once the midpoint is calculated from the ground truth label, once the midpoint is predicted by the neural net.

The second question, how can there be multiple bounding boxes for 1 object also has two contexts. First, there is exactly one bounding box per object in the ground truth data. It is mapped to exactly one location in the y_train ground truth object. During prediction, each location (grid cell plus bounding box) makes predictions. There is nothing to prevent multiple locations from predicting the same object, particularly if the object is large and the center is near a grid cell boundary. YOLO takes all these values predicted by the network and submits them to further downstream processing, either thresholding or non-max-suppression, to produce a final list of confident, unique predictions.

I have some other threads on this forum that go into substantially more detail. This one might be of interest. Or I think you can look up my other YOLO posts via my user name. Let us know if it helps?

ps: I don’t get the point of just Googling something and posting the links. doesn’t seem to add much value. To paraphrase Wittgenstein ‘That whereof we cannot speak, thereof we must remain silent.’

3 Likes