How does Yolo work when "live"

I do understand how Yolo is trained, but I did not really understand how it operates when you are actually out driving on the street. When training, we use bounding and anchor boxes, where each anchor box has its own position in the y-vector, and in turn, each such position looks like [p,x,y,b,h,“class-vector”]. But, for “live” data we do not have the bounding box. Are the features extracted by the ConvNet used to make the algorithm “look” to which anchor the object most likely belongs to, and then which class in that anchor that the object most likely belong to? Or do I misunderstand?

And then just a thought on softmax vs sigmoid. Since it will only be one correct class within each anchor box (right?), Softmax should be used. But, would the performance be so much worse using the Sigmoid instead? Or will that cause problems when the algorithm is running on unseen data?


One thing I find helpful when talking about how YOLO works is to always include the qualifiers ground truth and predicted rather than just saying bounding box, which is ambiguous. At training time both ground truth bounding boxes and predicted bounding boxes exist, and the error between them is what drives parameter learning. At operational runtime, only predicted bounding boxes exist. There is no usage where no bounding boxes are present.

Here’s why I say that. Forward propagation in YOLO, in both training and operational use, simultaneously produces S*S*B sets of predictions. If S=19 and B=5, that means 1,805 sets of predictions. All 1,805 are produced every forward pass. No exceptions. Each set, as you note, is comprised of (1+4+C) floating point values. The 4 includes the predicted center coordinates, b_x, b_y , and the predicted bounding box shape, b_w, b_h. So it is imprecise to say ‘for “live” data we don’t have the bounding box.’ In fact, for ‘live’ data there are 1,805 bounding boxes…the predicted bounding boxes. Post CNN processing using thresholds and non-max suppression may reduce that number, but they all existed as CNN output. YOLO produces predictions for all (S*S*B) detectors and all C classes at the same time, and just uses the predicted probabilities (or confidence) to rank and choose from among them.

Hope this helps. I’ll see if someone else wants to reply to the other parts of the question.

@ai_curious Thanks, yeah that really helps. I think I got a little bit lost in my way of thinking about the input. Of course the bounding boxes only exist as numbers in the y-vector, and they are not actually a part of the input picture. I really appreciate your answer!

Let’s see if someone else would like to address the second part of my question :slight_smile:

Hi @G11,

For the second part of the question, yes sigmoid can be used. So for each class, you would have the probability of the class existing.
But since each box has only one class, you would be interested in the highest class score only. So using softmax is better.
Also, the logits which you obtain when you train with softmax vs sigmoid is different. So incase you train with a softmax, it would be better to not use sigmoid during inference.

As there are lots of derivatives, it may be difficult to talk about “implementations”. But, I suppose essentials are covered by @ai_curious.

For better understanding of “what is object recognition ?”, we may be better to start from a bird’s-eye view.
Object recognition is a general term, and includes three tasks.

  • Localization : identifies a location of an object with a bounding box
  • Classification : predicts the type or class of an object
  • Object Detection : includes localization and classification for all objects in an image

YOLO conducts localization and classification at the same time with setting appropriate loss function for those. So, at a training time, as described above, YOLO needs a set of images and annotations which specify 1) a class of an object in an image, and 2) bounding box size. So, the output from a backbone network are 1) localization information (bounding box information), and 2) classification information (like car, traffic light, …). One important addition is “confidence” to indicate how much a bounding box covers target objects for further box suppressions.

In the first two releases, YOLO used Softmax for classifications.

YOLO v1 :You Only Look Once: Unified, Real-Time Object Detection
YOLO v2 : YOLO9000: Better, Faster, Stronger

But, it was changed at V3 with following reasons. Both are from YOLO v3 paper, YOLOv3: An Incremental Improvement

  1. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers.
  2. Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.

So, you caught good points actually. :slight_smile:
As there are many derivatives, some may still use Softmax. But, the YOLO concept is the above.
And, I think above covers your second part.
Regarding “unseen data”, those are just a background. YOLO is trained based on given class definitions and annotations on images. So, if an object is not in a class list, then, it is considered as a background.

Hope this clarifies.

@anon57530071 Thanks a lot for your answer. Really interesting to hear about the updates in the YOLO versions. Then I only have one single question left, and I will move on with the next course: Do I have to have the same classes in all anchor boxes? Say that I have two anchor boxes: one being a rectangle with a greater width than height, and one being higher than wide. I.e, the first one suitable for cars, trucks and so on. And the other one suitable for humans, traffic lights, space rockets (why not?). Can I only have class “car” and “truck” for the first anchor box, and class “human”, “traffic light” and “space rocket” for the second? Or do both boxes need all five classes? With different “class-vectors” for each anchor box, the code would be more compact, right?

Anchor boxes are independent to class definitions.

In V3, there are three layers which have different size. Three anchors are prepared for each. Here is the shape of anchor boxes.

Those are defined by K-means of bounding boxes for a target data set (like COCO Dataset). In this sense, it is recommended to adjust the size to capture some characteristics of target data sets.

As you see, YOLO prepared 9 prior anchor boxes with different sizes/aspect rations. With this, majority of objects in COCO datasets (like car, aeroplane, bicycle, …) should be able to cover.

During writing this, I recall one anther reason why YOLO (and some others) does not use Softmax. It is for hierarchical classifications. A simple classification assumes the flat definition of classes, but, sometimes labels are created from the hierarchical structured data base like WordNet. An example is a relationship among dog-> terrier-> Yorkshire terrier… Researchers thought that selecting one by Softmax is not very flexible.

To the best of my knowledge there are 6 “versions” of YOLO

The original, circa 2015/2016 is described here: [1506.02640] You Only Look Once: Unified, Real-Time Object Detection

The next two, which are v2 and YOLO 9000 are described in the same paper : [1612.08242] YOLO9000: Better, Faster, Stronger

It was YOLO 9000 that introduced hierarchical (not mutually exclusive) classification and thus switched away from softmax.

The last paper from the original team is known as v3 and is described here: [1804.02767] YOLOv3: An Incremental Improvement

After v3, lead researcher Joseph Redmon stopped working on object detection because of the uses he saw being made of it. Subsequently there have been two additional releases by other people.

V4 : [2004.10934] YOLOv4: Optimal Speed and Accuracy of Object Detection.

I’m not aware of a published paper for v5. There are some interweb locations that refer to a ‘new’ or ‘modified’ or ‘improved’ v5 but nothing I have seen claims to be a new “version.”

The lectures and code in this class are based on v2. It’s important to provide that context whenever asking questions about or discussing YOLO, because there are some substantial differences between the versions.

Regarding the question on anchor boxes and classes, in YOLO v2 anchor boxes are indeed class independent. The number to use, and their sizes, is determined by exploratory data analysis on the training set. Picking good anchors influences training effectiveness and efficiency because of how the anchor shapes influence predicted bounding box shapes, which in turn influences loss calculations. Bad anchors lead to bad predictions both during training and at runtime. I wrote up some of my experience generating anchors for a custom dataset here: Deriving YOLO anchor boxes which may help here. Cheers

1 Like

thanks @ai_curious, but just to clarify, when you say anchor boxes are class independent: do I have the same class vector for each anchor, or can I have car and truck in one anchor, and human, traffic light and space rocket in another? In the lecture by prof. Ng it seems to be the very same class vector for all anchors.

For v2, which is what the code in this class was based on, every grid cell + anchor box tuple (let’s call that an output location) in the network output has a vector of predicted class probabilities. The vector is the same shape for every network output location. The order of the classes within that shape is the same for every output location. For example, car is element 0, motorcycle is element 1, person is 2, toaster oven 3, or whatever. Somewhere there is a dictionary that maps these indices to their class names for display on the output. What is different across output locations is only the floating point values of the predictions in each of those class vector positions. Maybe for one output location, the features strongly suggest a car, so position 0 will be high, close to 1.0, and the other 80 will all be near 0. But if the features suggest person, it is position 2 that will be high. Maybe the object is not recognized, and all the probabilities will have a low value (maybe 1. / 80.) Etc. It is never the case that you have different shaped class probability vectors for the different output locations, or that the order of classes they represent differs. Hope this helps.


You should also think about this question in terms of the training data and the loss function. Suppose you have a 3x3 grid, and each grid cell has 2 anchor boxes. 18 locations in all. We’ll use the 4 class universe I mention above. Now suppose one input image has a toaster oven right in the center and just background everywhere else. 17 of the locations will have all training data values 0. For the center grid cell, whichever anchor box shape has the highest IoU with the groundtruth bounding box will have non-zero values. P_c = 1.0. b_x and b_y will be 0.0 because the object is in the center of the grid. b_w and b_h will be proportional to the ratio of the groundtruth bounding box to the size of the grid cell. The class probability vector will be [0.0 0.0 0.0 1.0] reflecting that the object is a toaster oven. Later, during training time, softmax will result in a floating point value for each c_i. Maybe early in training it is [.3 .05 .05 .6]. The loss function will compare these predictions to the groundtruth. This is why the class prediction vector must be the same shape, and the indices within the vector represent the same classes, for each output location. After a few epochs we’d want the car prediction to go down and the toaster prediction to go up [0.1 0.05 0.05 0.8]. We need to be able to know that the highest prediction, which is in class vector location 3, means toaster oven. Does this make sense?

1 Like