The CNN Week3 YOLO lab assignment has a lot of contradictory/incorrect information, and is verbose too. Here are a few examples:
Anchor boxes are defined only by their width and height. [My understanding: Anchor boxes are defined by size AND position.]
Select only one box when several boxes overlap with each other and detect the same object. [My understanding: “object” should be replaced with “class.” NMS suppresses overlapping boxes based on predicted class.]
Exercise 2 - iou: “This code uses the convention that (0,0) is the top-left corner of an image, (1,0) is the upper-right corner, and (1,1) is the lower-right corner. In other words, the (0,0) origin starts at the top left corner of the image. As x increases, you move to the right. As y increases, you move down.” CONTRADICTS “The top left corner of the intersection (𝑥𝑖1,𝑦𝑖1) is found by comparing the top left corners (𝑥1,𝑦1) of the two boxes and finding a vertex that has an x-coordinate that is closer to the right, and y-coordinate that is closer to the bottom.” [y1 should be closer to the top]
In the absence of further examples, maybe let’s explore these one at a time. This might help us identify the source of some of the contradictory and incorrect information. In your understanding is position a pixel in an image? A 480x480 image contains 230,400 pixels; are there also 230,400 anchor boxes? If not, how many are there? Is it the same for every image? How is an anchor box position determined, assigned, and used in the code? If I recall correctly, the anchors.txt in the exercise contains 10 numbers read into a Python list. The code treats these as 5 pairs of 2; is one of those the position? The equations for predicted bounding box shape in the original YOLO papers (and in the code for this exercise) only describe and use two values for an anchor box. Is either one of those a position?
The equations show up in the research papers but also in several existing threads. Here is one:
Note: the paper refers to them as dimension priors with p_w and p_h for width and height, respectively. (Spoiler alert, no anchor box position)
For case 1), my theory is that you are really talking about bounding boxes, as opposed to anchor boxes. The bounding boxes are part of the YOLO output and they do include both size and position. The anchor boxes are essentially input to the computation and are used as a way to both a) make the algorithm more efficient and b) organize the output. They are essentially just “aspect ratios” and are not tied to a particular location. There are many excellent posts on the forum from ai_curious which add a lot of context to the material in the course. Here’s a good one to start from on the topic of Anchor Boxes.
For item 3), you have two rectangles and you compute the intersection of them. Now think about the upper left corners of both rectangles. If a point is to be in both rectangles (the definition of intersection), then its coordinates must be downwards and to the right of the coordinates of both of those upper left corners, right? That means that the x coordinate must be further to the right and the y coordinate must be closer to the bottom than the corresponding coordinates of both upper left corners. Similarly for the reasoning about the lower right corners: all points in the intersection must be above and to the left of both lower right corners.
Of course the intersection may be trivial, but I believe that the descriptions in the text in the notebook are correct.
For point 2), I think that is reasonable terminology. An object is an element of a particular class, so you can refer to it either way. “Object” and “Class” are just two ways to say the same thing when you are just talking about the algorithm in words. Of course when you get to writing the code, then you need to be perfectly unambiguous.
Agree with @paulinpaloalto on most of the above, and similarly conclude that the first and third assertions in the original post are incorrect. However, I don’t agree with this one. “Class” is referring to an abstract concept, a type. “Car” doesn’t have a location in an image. It doesn’t have a shape. “Car” is not what non-max suppression is operating on. Rather, it is operating on one or more instances of type Car, that is Car objects, each of which do have a location and a shape. NMS is implemented to suppress duplicate predictions on the same object, that is two or more predicted bounding boxes output by the network that overlap significantly enough that they are likely both containing the same object, and therefore only one should be retained for further downstream processing.
Note that the implementation of NMS discussed in the original paper and in the early versions of the coding exercise ignored class entirely, as does the implementation in TensorFlow. It is my understanding that NMS is now run multiple times, once for each Class present in the predictions output by the network. Notice that each invocation, however, is still curating a list of specific objects, not lists of Classes of objects.
So my opinion is that assertion 2 is also incorrect.