Object detection using yolo

while doing object detection using yolo algorithm why do we get more than one bounding box for each object before using non-max suppression ?

1 Like

YOLO v2, which is the basis for the autonomous vehicle programming exercise, makes S*S*B predictions each forward pass, where S*S is the grid cell count and B is the number of anchor boxes. Each of those S*S*B locations, the original YOLO paper refers to them as detectors, makes its own prediction about whether or not an object is present or not. If an object is near a grid cell boundary, or significantly overlaps two grid cells, it is entirely possible that two neighboring detectors will each think the object center is in their location. Non-max suppression can then remove one of the duplicates by assuming that if two predicted bounding boxes mostly overlap (highly similar location and shape) then they must be the same object.

Notice that if the localization prediction was always 100% accurate this would never happen, because each object is only in one actual location in the image (at least for non-quantum solution space for these discussions!). This situation arises when there is a lack of precision in the localization output.

My understanding:
An object can cover multiple grid cells and each grid cell covered can claim that it has the object by giving a bounding box as output, resulting in multiple boxes. Out of which, we select the one that claims with most confidence.

Please suggest any corrections.

1 Like

After further consideration, I believe it can happen also within one grid cell also because of anchor boxes. Especially if the detected object is not close to matching one anchor box in size, I think two anchor boxes from the same grid cell can end up with bounding box predictions. Only one of which should survive Non-max suppression.

So the overall idea is that the multiple bounding boxes are a result of multiple claims by either cells or anchor boxes, that they have the object. Is that right?

1 Like

my emphasis added

Confident that’s what you were thinking, but to be precise.

Each vector of predictions [p_c, b_x, b_y, b_w, b_h, c_1,…, c_n] sits at a specific location in the network output [S_x, S_y, B_i, …]. Duplicates are most likely coming from neighboring grid cells any anchor box, though if the anchor boxes and training are good likely same anchor box in each grid cell. Or, same grid cell, different anchor boxes. I think this could happen especially when two anchor boxes are close in shape. Hopefully if the anchor boxes are quite distinct in shape the training would reduce the occurrence of ‘false positives’ which is really what these multiple predictions on same object are.

Sounds to me like you have this :+1::bulb:

Yesss…

I don’t think they can be called ‘false positives’ because they really do have the object, but they just don’t contribute to the labels that interest us.

Almost… :sweat_smile:
Now I’m trying to understand how the object spanning multiple grid cells is detected when checking on a single grid cell.

I say that because we’re considering the case where there is only one object, but two grid cells ‘claim’ it to use your word. The object center is only in one of those locations, so any others are mistakes.

The key to understanding how a bounding box prediction can be larger than one grid cell is this diagram from the paper…

image

Bounding Box width and height b are multiples of anchor box width and height. Here p is used for anchor box because the paper refers to them as priors. The anchor box shape is multiplied by e^{t}, where t is the direct output of the network. e^t can be any positive number. If t \gt 0 then e^t > 1. and b will be larger than p, and even larger than grid cell size.

Here’s another recent thread that should resonate…