I understand that IOU computes the area ratio between the projected or predicted (drawn purple in the instruction video) and ideal (red) bounding boxes. And we know for fact that in order to calculate any of these square or rectangular areas, we will have to have the coordinates to each boundary box.

My question: if we already know where the ideal boundary box should be (meaning we already have its coordinates, therefore its size value to put in ratio against the projected box), why would we bother letting the algorithm draw an unwanted box in the first place?

If we already know the exact location of our target object in the image, why play the âaccuracyâ game at all?

Link to the course material: https://www.coursera.org/learn/convolutional-neural-networks/lecture/p9gxz/intersection-over-union.

The point is that we donât âletâ the algorithm draw unwanted boxes and we donât know the ideal bounding boxes *a priori* in âpredictionâ mode. Itâs only in training that we have the labelled data, right? We just train the algorithm and it does what it does. If we donât like the results, then we need to retrain it with either more and better data or better hyperparameter choices or both. I have not studied the YOLO paper(s), so donât know the full details of how they arrived at this algorithm, but they apparently realized that it frequently detects the same object multiple times with slightly different bounding boxes. Adding this ânon-max suppressionâ step by using IoU is just a computationally inexpensive way to refine the outputs and get better results.

There are some excellent threads on the forum from the past few years that go deeper into various aspects of YOLO than what we see in the lectures and the assignments. Hereâs one that discusses non-max suppression. Please have a look and I hope that will shed more light on this question.

1 Like

The linked video discusses IOU in the context of evaluating your object detection performance, of which there are two components: first, did we get the classification correct, and second, how good is the localization. IOU is being used here to evaluate the performance of the localization. We compare the correct, known, ground truth bounding box, with the predicted bounding box. A perfect prediction would have an IOU of 1.0 while a perfectly awful prediction would have an IOU of 0.0. An average IOU is interesting to know for both the training and validation data sets since it is a measure of localization error that doesnât require instrumentation of the cost function itself.

As @paulinpaloalto states above, you only have this information available at training time because at runtime, you wonât have ground truth.

Note that IOU does have a role to play in some runtime algorithms, but then it is used to compare two predicted bounding boxes to try to determine if they contain the same object. Two predicted bounding boxes with an IOU of 1. contain the exact same pixels, and thus are highly likely to be of the same object. Since there is only one object, you can think of one of the predictions as a true positive and one as a false positive. Two predicted bounding boxes with an IOU below some threshold can be assumed to be of different objects. I think this is the subject of the following video (NMS)

Reasonably accurate localization thus is important in optimizing the truth table of object detection algorithms (high true positive, low false positive) which is why one would use IOU during training.

Hope this helps

1 Like