I have watched the videos several times relating to anchor boxes, yet I can’t really grasp the concept. I understand that anchor boxes have the objective of detecting multiple objects in an yolo box. I also understand that the output is [Pc, Bx, By, Bw, Bh C1, C2, C3, Pc, Bx, By, Bw, Bh C1, C2, C3] If there is two anchor boxes.
However what I am confused about is why can’t yolo just output [Pc, Bx, By, Bw, Bh C1, C2, C3] for every class and therefore be able to tell all the objects in a yolo box.
Any insight would be greatly appreciated
Let me rephrase the objective of YOLO at first. YOLO is the object recognition application.
So, expected output is a combination of a bounding box (square around an object) and an object type.
There are multiple steps to generate final outputs, which are multiple bounding boxes and object type for each.
Convolutional network creates feature maps. - A feature map includes multiple candidates based on anchor boxes. In here, one object may be included in multiple anchor boxes. And, object type (class) is not finalized yet. This is what you wrote as a YOLO output. b_x, b_y, b_w, b_h are the size/shape of an anchor box, and c_1, c_2, c_3,...c_n show the probabilities for each types that are in this particular anchor box. For example, c_1 represents “person”, c_2 represents “bicycle”, c_3 represents “car”, and so on. And, outputs are, c_1=0.1, c_2=0.01, c_3=0.8, c_4=0,. In here, this anchor box is most likely representing a “car”. But, again, everything is not finalized, since one object may be included in multiple anchor boxes. This is what you wrote as “output”.
Then, YOLO head starts classification (what type it is) and localization (where it is). One object, say “car”, may be included in multiple anchor boxes. By applying “Non-Maximum Suppression”, YOLO starts to merge anchor boxes, and, finally creates “one bounding box” for “one object (class)”.
So, final outputs from YOLO is multiple bonding boxes with a single class information in one bounding box.
Hope this clarify what YOLO is doing. Then, if you have further questions, please free to post any.
Here’s how I think of it. Suppose we want to do classification when there is one object in an image. You can run a CNN forward prop and easily generate a prediction, right? Cat. But how do you deal with images containing two objects when the network only produces a single output? The initial approach was to divide the input image and run the same classification network on all parts. If there is one object in each subdivided part, then we’re good. Except now you are doing lots more computation. And some of those regions may still have multiple objects. YOLO was a reaction to this challenge. How to deal with multiple objects, possibly near each other, but still run near real time. By introducing grid cells (number of grid cells is S in the YOLO papers) and anchor boxes, B, a YOLO CNN can output S*S*B classification predictions from a single forward pass. You kind of get the best of all worlds; high enough accuracy even on multi-object images at a very high frame rate. When it was introduced circa 2016, YOLO was competitive in accuracy with state of the art region-based approaches but was substantially faster, which is why it is still studied 6 years later. Hope this helps.
Hi and thank you for you response, unfortunately I am still confused the use of anchor boxes as instead of using anchor boxes it would be much better (in my opinion) if you had the output show a bounding box for each class.
Please give me insight into anchor boxes, thanks
Maybe you can elaborate on why your design approach would be better. More accurate? Runs faster? Both? Etc
There are two contributions that anchor boxes make to the effectiveness of YOLO.
First is that anchor boxes provide reasonable baselines for bound box shape/size predictions. As stated in the papers, the values of the bounding box predicted width and height are related to anchor box shapes as b_w = p_w * e^{t_w} and b_h = p_h * e^{t_h} where t_w and t_h are outputs of the neural network and (p_w,p_h) is the shape of an anchor box. As detailed in other threads, the anchor box shapes (p_w, p_h) are based on shapes of objects in the training data, not on the classes/types of those objects.
Second is that it helps the network to make multiple object detection predictions on each input image. As mentioned above and in the self_driving car programming exercise, the YOLO network makes S*S*B detections on each image from each forward pass, and each detection is (1 + 4 + C) floating point numbers. The 1 is the object presence prediction, the 4 are the bounding box center location and shape predictions, and C is the vector of class probabilities.
Doing away with anchor boxes entirely would remove the positive impact they have on convergence for the anchor box shape training as you would start with random initialization instead of using the common training data shapes. Further, if you changed from the number of anchor boxes used (generally an integer less than 10) to the number of classes, it would substantially increase the number of predictions you would have to make from each forward pass. By that I mean the course uses B = 5 and C = 80. If you did a full detection for each class, the network output size increases by 16x. But it is common to train on 1,000 types. And ImageNet actually contains some 20,000 of what they call categories that correspond to what we call classes. The original YOLO took a week to train, so I’m not sure how practical it would be if B = 20,000.
Finally, the bounding box center location and shape predictions are not dependent on the class prediction. They are entirely separate outputs of the network. So how and why would the network produce different bounding box center location and shape predictions for a given object? Wouldn’t they be the same for all possible classes? The features extracted from a certain region of the image suggest where an object is regardless of what it might be.
Sorry, but I don’t yet see the upside of this proposed design.
I am currently learning about YOLO and I have a question about why anchor boxes are needed if you have dimensions of objects already from bx by (midpoint) and bh bw.
This slide is talking about using anchor boxes when you have two objects show up in the same grid cell. In this slide it shows that you compare the dimensions of an object to two different anchor boxes and pick the one with the higher IOU my question is why do you need the anchor boxes if you have the dimensions of the object in the first place ? Why dont you just make a custom bounding box for that object with these dimensions …bx by (midpoint) and bh bw. Sorry if this is a elementary question just learned about YOLO algo .
I’m not sure I completely understand the question. But I think this slide and discussion is in the context of training. During training you do have all object shape(s) and center location(s). You use them in the loss computation to train the model on which grid cell and anchor box are correct for the ground truth objects. You then use the learned parameters to make better predictions at operational runtime when you don’t know the shape or center location. Does this make sense?
Here are a couple of related threads, but you can find many others using the search…
Yes it seems abit clear to me now. In this slide are anchor boxes 1 and 2 pre determined shapes that are assigned to a specific class (for examples every car get this shape rectangle)or are they shapes that the network learns from the correct training example from the parameters (bx by bh bw) ?
Anchor boxes shapes are predetermined, but they are not class-specific. Think it through; cars close up and cars far away are not the same size. Also, some models have been trained on data with thousands, even tens of thousands, of classes. Since anchor box number determines YOLO network output shape and thus number of predictions (and computations) you maybe can have 10 anchor boxes but you cannot have 10,000.
The anchor box shapes are learned from the training data, but not by the YOLO network and not including location (b_x, b_y). Rather, the shapes are selected through running unsupervised learning on the ground truth shapes only…location is irrelevant. This also has been previously discussed extensively in the forum.
@ai_curious so in this situation if the rectangles were applied incorrectly to the wrong objects the two vectors (the output vector from the network and the correct vector from the training example) would be compared and iou would be done with bx by bh bw to adjust them ?
Again, not sure I completely follow the idea in the question, so I’ll just make some assertions and see if that helps.
During establishment of training data there is no ‘incorrectly’. Each ground truth object bounding box is centered in exactly one grid cell. Each ground truth object bounding box has the highest IOU with exactly one anchor box. That grid cell + anchor box tuple is assigned the ground truth object, and that object is not assigned to any other grid cell + anchor box location. If you assign those locations incorrectly, then you are training your model to learn to detect objects in the wrong places…don’t do that. Don’t proceed to training or using the trained YOLO model until that part is correct. Garbage in, garbage out, right?
Anchor box shapes are never adjusted. Once their shapes are determines during exploratory data analysis you just use them until you decide you need to use a different set of anchor boxes. In that case you replace them all and start the entire training process over again.
During training, anchor box shapes are used as part of the prediction of bounding box shape (again, this relationship between anchor box shape and predicted bounding box shape is discussed at length in existing threads, no sense writing it all again here). The loss function compares ground truth bounding box shape with predicted bounding box shape and iteratively adjusts value in the weights matrix to minimize total error. IOU between anchor boxes and ground truth is not directly calculated or used during training iterations loss computation. The predicted bounding box center location and shape dimensions are never adjusted directly. Rather, the weights matrix that lead to those values being generated is modified and the forward pass resulting in predicted outputs is done again.
You answered my question thank you ! I was referring to the model predicting the wrong boxes during training the YOLO algo not during establishing training data.