[Deep Learning Specialization W3A1 - YOLO] How does NMS know which grid cell our filtered/masked scores and bounding boxes correspond to?

How does the tensor flow yolo_non_max_suppression (NMS) function know where the bounding boxes are if they’re specified on a per grid cell basis (i.e relative to the start and end of a cell), and we haven’t indicated which grid cells were removed during our filtering process. How do they calculate the IOU if we aren’t passing the exact coordinates of where the boxes are in the image, and only where it is relative to its grid cell. We also don’t seem to be specifying where the grid cell is for a given score and box coordinates.

If we knew the grid cell for a given bounding box 4-tuple, I’d know how to transform it into absolute image coordinates accoridngly, but without that information, and without knowing exactly which ones have been filtered based on a low score threshold, I’m unsure as to how we know where to draw the bounding boxes.

Would really appreciate some help on this!

We don’t remove grid cells, although it may happen that no objects are recognized in a given grid cell.

Take a look at all the data that we are dealing with here, the tensors all have shapes like:

(None, 19, 19, …)

The first dimension is “samples” and the second and third dimension specify the grid cell, right? The full image is divided into 19 x 19 grid cells. So you have that information in addition to the bounding box coordinates, which are relative to the grid cell.

Yes, we have the information for the 19 by 19 grid cells here, my confusion lies in that what we’re passing into the NMS function is not a 19,19,5 tensor but rather just a 1D tensor with the max scores and a 2D tensor with the corresponding boxes for those scores, but no where do we pass in the grid cells they are relevant to. We removed that extra information in doing the argmax[-1] and reduce_max(-1) functions.

Ok, sorry, it’s been a while since I looked at this assignment and perhaps I’m a bit “out over my skis” here.

I’m in the process of trying to reconstruct my understanding of how this works.

In the meantime, we can also search for some of the history on the forums. These issues have come up quite a few times before. YOLO is by far the most complex algorithm we’ve seen up to this point in DLS and there’s quite a bit to learn and study here.

Here’s a thread about NMS and YOLO.

Here’s a thread about Grid Cells and Anchor Boxes.

No worries.

Unfortunately those links have more to do with a conceptual understanding of YOLO, which I feel I have. I’m currently lacking knowledge on how my intution of the algorithm maps to the code given tf is in many ways still black box like.

I did query chat gpt as to how to ensure the filtered grid cells were also preserved, and this is what they outputted for code. This makes much more sense to me, as they pass the basolute locations of the boxes, I just cant seem to figure out why the current code inthe assignment works, without doing this step below (grid_cell_offsets = tf.gather_nd(grid_offsets, grid_indices) # Retrieve grid cell offsets for each bounding box
).

import tensorflow as tf

# Suppose you have the following tensors:
scores = [...]  # Tensor containing confidence scores for each bounding box
bounding_boxes = [...]  # Tensor containing bounding box coordinates (x, y, width, height) relative to grid cells
grid_offsets = [...]  # Tensor containing the grid cell offsets for each bounding box

# Filter out bounding boxes based on a confidence score threshold
threshold = 0.5
filtered_indices = tf.where(scores >= threshold)
filtered_scores = tf.gather(scores, filtered_indices)
filtered_boxes = tf.gather(bounding_boxes, filtered_indices)
filtered_offsets = tf.gather(grid_offsets, filtered_indices)

# Now, transform the filtered bounding boxes to absolute image coordinates
grid_cell_size = [...]  # Size of each grid cell
grid_indices = tf.cast(filtered_boxes[:, :2], tf.int32)  # Extract grid indices from bounding box coordinates
grid_cell_offsets = tf.gather_nd(grid_offsets, grid_indices)  # Retrieve grid cell offsets for each bounding box

# Compute absolute bounding box coordinates
absolute_x = (grid_indices[:, 0] + filtered_boxes[:, 0]) * grid_cell_size + grid_cell_offsets[:, 0]
absolute_y = (grid_indices[:, 1] + filtered_boxes[:, 1]) * grid_cell_size + grid_cell_offsets[:, 1]
absolute_width = filtered_boxes[:, 2] * grid_cell_size
absolute_height = filtered_boxes[:, 3] * grid_cell_size

absolute_boxes = tf.stack([absolute_x, absolute_y, absolute_width, absolute_height], axis=-1)

# Apply non-maximum suppression (NMS)
selected_indices = tf.image.non_max_suppression(absolute_boxes, filtered_scores, max_output_size=100, iou_threshold=0.5)

# Retrieve selected bounding boxes and scores after NMS
selected_boxes = tf.gather(absolute_boxes, selected_indices)
selected_scores = tf.gather(filtered_scores, selected_indices)

# Now, you have the selected bounding boxes and scores after NMS

Note in the above code classes aren’t being considered in the iou step - it seems NMS doesn’t actually do this.

The TensorFlow implementation of NMS is only concerned with pruning possible duplicate predictions. It doesn’t need to know anything about grids or cells because by the time it is run it has the complete predicted bounding box coordinates. This is true in the code used by the DLAI exercise and what was generated by chatGPT and included above. Inside the NMS function, two of the predicted bounding boxes at a time are pairwise compared using IOU. If the IOU is high enough, ie higher than a configurable threshold, the two boxes are considered duplicates and only the highest confidence prediction is retained. If the IOU is lower than the threshold, both are kept. In the latter case, either the two are in different locations or are mostly in the same location but are different shapes, so not considered duplicates.

Notice that class is not needed in order to make this decision. There are a number of prior discussions about whether NMS should be run per-class or not. My position is not, but not everyone in the forum agrees. Notice that if IOU == 1. then the bounding boxes occupy exactly the same pixels of the image, so what additional information would class provide? And remember that class is itself a prediction, no guarantee that any of them are correct.

If the TF NMS function did include grid cells within it, then there would have to be two versions…one for YOLO and one for everyone else.