Week 3 - Assignment 1 - Computation of Class Score: Why multiply Pc with C?

Dear all,

I fail to understand the computation of the class score that is described above and below Figure 4 in the assignment:

The class score is score_c,i=p_c × c_i: the probability that there is an object pc times the probability that the object is a certain class ci.

I always thought having pc was redundant when there is already ci which indicates the probability of a certain class in the image. If they all are zero or below a certain threshold then the classifier wouldn’t output anything.

So,

  1. What’s the point of having pc?
  2. Why do we have to multiply pc by ci instead of just taking the maximum of the values in ci vector which are above a certain threshold?

Thank you very much in advance,

c_i is to identify an object, i.e, what it is. As you see, there are 80 classes, which include “car”, “bus”, “traffic light”, and so on. And, each value represents the probability of each class. For example, car=0.8, bus=0.1, traffic_light=0.1. So, the object in a box is most likely “car”. But, it does not mean “car” is really there. That probability is represented by p_c. So, if p_c is small, say, 0.01, even if one of classes (c_i) is a high value, say 0.9, there will be most likely nothing.

1 Like

@anon57530071 provides some good insight, but there are a few more tidbits that might be helpful.

  1. If they all are zero YOLO used softmax activation at the output layer (at least prior to YOLO v3) so the class predictions would never all be zero (they have to sum to 1.)

  2. If you read the early papers, especially the original YOLO 2015 you see
    If no pred object exists in that cell, the confidence scores should be zero. And then we multiply the conditional class probabilities and the individual box confidence predictions … which gives us class-specific confidence scores for each box.

By multiplying the c_i and P_c you are in effect weighting the class prediction by the likelihood that some object (of any class) is there. If the prediction is that likely no object is present (low P_c), then you don’t care what class is predicted. The net effect of these predictions on the loss computation, and thus the learning, is low. You want the network learning only (or at least mostly) from the true object locations, which would not be the case using c_i alone. Let us know if it’s clearer now?

1 Like

Agree with your response with the exception that I might have chosen to write classify there. Predicting ‘what it is’, yes, but more precisely what type it is.

Dear @ai_curious,
Thank you very much for your kind and detailed response.
Your explanation was amazing and it couldn’t have been done any better, however I cannot still wrap my head around having a separate output variable pc.
Your point is absolutely right: 1. If they all are zero YOLO used softmax activation at the output layer so the class predictions would never all be zero (they have to sum to 1.). Then why there isn’t another class label for “background” (A.K.A no object) instead? This way all the class predictions could add up to 1.

I don’t know if there is a historical reason for the common practice of using c and not c + 1 for the range of outcomes. There is nothing preventing one from doing so, so maybe it’s just a pragmatic question of whether it consistently makes the model better in any way, or whether you can just achieve the same result through threshold on the conditional probabilities. That is, regardless of whether the class prediction is high but the object presence prediction low or vice versa, you still end up with a low confidence prediction that will likely be filtered or ignored.

Perhaps the difference is as simple as this. For the non-background classes you train on positive examples: this is a dog, this is a cat, this is a motorcycle. But if you then provide an image that does not contain one of those, what features can be learned to classify “background” ? Is an image filled with sky background? Field of wheat? Ocean? What about an image of a table? Seems hard to learn a generalizable set of features of “background”. But straightforward to say the confidence is low that it is any one of the c classes in the training set.

1 Like

That’s a very good reasoning.
Maybe I should’ve said “neither of those 80 classes” instead of background, meaning that the network doesn’t learn anything on background features, instead it would learn what a cat is, what a car is, etc, and everything else would fall under “neither of those 80 classes” class.

I don’t think that solves the problem of how to learn features during training. Say you have 3 classes: car, cat, and other. Then a training image of a motorcycle is labelled “other”. As is one of a loaf of bread. And a clock… I don’t believe that results in learned parameters that will yield a confident prediction of “other” at run time. You could run experiments to measure or, if mathematically inclined, try to reason through the implications. Maybe one of the more math literate mentors could help you. @paulinpaloalto might know, or know who would.

There are a couple of challenges with this concept.

First, as others have mentioned, the training set for “other” is infinitely large. You can’t practically train a system to detect everything that isn’t one of the other classes.

And if you impose an artificial threshold (below which you say it’s the “other” class), then you give up the ability to make a useful prediction in cases when all of the examples are below that threshold - even if one of them could be clearly a more confident choice than the others.

1 Like

The important step for object detection is to set a bounding box. YOLO does two processes, i.e., object detection and classification simultaneously. Then, create a bounding box and set probability. There should be no empty bounding box since it is a “Posterior Probability”.

[addition to clarify the last sentence]
The last sentence is misleading… There can be empty bounding box. If “posterior probability” is set in any of classes, then, there is something in a bounding box. That is what I meant… Sorry for making everyone confused.

Thank you so very much for your amazing response.

Ok, I am taking it back, what I said about threshold. Wouldn’t having a softmax for 80 (all categories)+1(neither of 80 categories) classes work at all?

I was slightly lost with

If no pred object exists in that cell, the confidence scores should be zero.

I believe an original paper wrote as “If no object exists in that cell, the confidence scores should be zero” But, that might be OK.

I think I should quickly explain the role of P_c. It is not just a parameter whether an object is there or not. (In this sense, it can not be replaced by having “other” class.) It is an IOU to describe how much this bounding box covers a grand truth object if an object exists. This is quite important, since eventually, we need to select/merge bounding boxes to represent objects.
For example, if a target is “car”, but this bounding box only covers small percentage, the score should be lower. That weighting is done by IOU.
I think some confusion comes from what YOLO trains. It is not cat/dog types of classifier. It is a kind of bounding box identifier. Actually, training data is quite unique. It has, of course, a label like “car”. In addition, we need to add some additional data about a car in an image. That’s bounding box information to fully represent an object, “car”, in this case.
With these data, YOLO can detect an object with bounding box information (and how much it covers an object). In this sense, I should say, YOLO is not a classifier, but is a bounding box detector for specific objects.

I suppose the above covers your original question.

  1. What’s the point of having pc?
  2. Why do we have to multiply pc by ci instead of just taking the maximum of the values in ci vector which are above a certain threshold?

In short,

  1. P_c includes information about the coverage of bounding box.
  2. Even if C_i value is high, if the bounding box does not cover an object well (like only small part of an object), the score as a bounding box is low.

How do you propose to assign the predicted value for the “neither of 80 categories” category?

Exactly. The reason that doesn’t work is what does the training set look like for “none of the above”? This is the point that @ai_curious also made earlier in this thread. The way you train a softmax classifier is that you need to have labelled samples for all possible classes: there is no separate mechanism for handling “none of the above”.

Maybe it would be worth discussing how YOLO goes about learning to assign the P_c value. What does the training data look like for the case of a low P_c? I’ve never looked at any of the YOLO papers, but I know some folks who have :nerd_face:

@paulinpaloalto
I suppose I covered those points. I think “none of 80 categories” may the system performance worse.

Let’s start with "what P_c is.

P_c = Pr(Object) * IOU_{pred}^{truth}

This is from YOLO paper.

If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

This implies, there are two cases.

  1. If there is no object, P_c = 0 since Pr(Object) = 0
  2. If any object exists, P_c = IOU_{pred}^{truth}. This mean, Pr(Object) =1

The definition of “confidence” may be enhanced along with YOLO version, but, the essential is the above.
Key outputs from YOLO are information about bounding boxes and conditional class probabilities, Pr(Class_i|Object) which is C_i in this exercise. This is a posterior probability. In original paper, it defines “These probabilities are conditioned on the grid cell containing an object”

Interesting internal structure of YOLO is its hierarchical structure.

  1. Split images into a given number of grid (cell). In the original paper, it was 7x7.
  2. In each cell, 2 bounding boxes will be predicted. (The number of bounding boxes was increased along with the version, if I remember correctly.) A bounding box can be larger than a cell size, but its center needs to be inside a cell. (In this sense, I suppose “cell” is for parallelization.)
  3. Each bounding box information includes P_c and bounding box location/size. It does not include conditional class capabilities.
  4. Each cell has “conditional class capabilities”, Pr(Class_i|Object). Each cell only has one set of Pr(Class_i|Object), regardless the number of bounding boxes.

(With this structure, there are some week points… For example, in the case of “there are multiple small objects inside one cell”, … But, I do not go further for this thread. :slight_smile: )

So, the key output is, again, bounding box information. Then, the next question should be how YOLO can be trained.
The secret is in the training data set. As I wrote, it includes a label and bounding box information. With this, a model can learn how to set the bounding box with a label for a target object, like a “car”.
Then, you remember that a constraint of this model, i.e, the center of bounding box needs to be in a given cell. A detected bounding box may not cover entire object (grand truth). Then, YOLO calculates IOU_{pred}^{truth}. That eventually becomes P_c.

In this sense, I do not think adding “others” makes sense. Moreover, it may be harmful for this model, in my opinion. Think about the training set. How you define “others” with a bounding box information ?

By the way, YOLO does not use Softmax.

1 Like

My understanding is that when ground truth data is established, p_c = 1 for the one grid cell + anchor box responsible for the object center and p_c = 0 for all others. It’s always a challenge talking about these things in part because the notation differs between the class materials and the papers. Redmon et al use Pr(object) for the object presence probability…they don’t use p_c. In the notebook markup it isn’t completely clear whether p_c is treated as Pr(object) or Pr(object) * IOU (b, object). The language is either ambiguous or, since there is no mention of IOU in these parts of the notebook, perhaps leans towards inferring it is Pr(object). I believe this interpretation is supported by the lectures and by these pieces in the exercise code…

def yolo_head():
    box_conf : tensor
        Probability estimate for whether each box contains any object.
...
    box_confidence = K.sigmoid(feats[..., 4:5])
...
    return box_confidence,...


def yolo_loss(...,rescore_confidence=False,...):
    rescore_confidence : bool, default=False
        If true then set confidence target to IOU of best predicted box with
        the closest matching ground truth box.
...
    pred_xy, pred_wh, pred_confidence, pred_class_prob = yolo_head(...) #NOTE: the return params are out of order in the version of keras_yolo.py I have from 2018

    no_objects_loss = no_object_weights * K.square(-pred_confidence)
    if rescore_confidence:
        objects_loss = (object_scale * detectors_mask * K.square(best_ious - pred_confidence))
    else:
        objects_loss = (object_scale * detectors_mask * K.square(1 - pred_confidence))  

none of which seems to me to align directly with this equation in the v2 paper

Pr(object) * IOU(b, object) = \sigma(t_o)

In our code, since box\_confidence = K.sigmoid(feats[..., 4:5]) and t_o is feats[...,4:5], then Pr(object) == box\_confidence == \sigma(t_o)

In summation, my read is that across the class lecture, notebook markup, and our version of the v2 darknet code circa 2018, p_c means Pr(object). Despite showing up in parts of both the v1 and v2 papers, I can’t find any support for p_c being Pr(object) * IOU(b, object) in our class materials. Rather, it is just treated as object presence probability (or confidence) unless restore_confidence=True in yolo_loss, in which case the interaction is still not multiplicative. Despite the ambiguity around whether the IOU of the predicted bounding box is included in p_c in this course material, I think we do all agree that the final class scores are the product of p_c and c_i. You can see that in the implementation

def yolo_filter_boxes(...):
    box_scores = box_confidence * box_class_probs
...
    box_class_scores = K.max(box_scores, axis=-1)
...

btw it looks like there is a cut and paste artifact /typo in my quote above from the original paper. The word pred was incorrectly left over after I deleted the copy/paste of equation (1) from the paper. Sometimes the Discourse UI on the iPad gets wonky when using emphasis fonts and LaTeX, but in any case apparently I didn’t proof well. My bad.

Welcome suggestions for clarification/correction

At least the version of darknet v2 that was initially used in this class programming exercise definitely uses softmax. See the section of yolo_head() in the provided keras_yolo.py

    box_class_probs = K.softmax(feats[..., 5:])

Not sure about other versions.

I think there are multiple versions which made us confused. :sweat_smile:

YOLOv3: An Incremental Improvement

Each box predicts the classes the bounding box may contain using multilabel classification. We do not use a softmax as we have found it is unnecessary for good performance, instead we simply use independent logistic classifiers. During training we use binary cross-entropy loss for the class predictions.
This formulation helps when we move to more complex domains like the Open Images Dataset. In this dataset there are many overlapping labels (i.e. Woman and Person). Using a softmax imposes the assumption that each box has exactly one class which is often not the case. A multilabel approach better models the data.

From V3, it seemed that following design (restriction) was revisited. Now, each bounding box has an objectness score, which was, I think, P_c in V2.

  • Each bounding box information includes Pc and bounding box location/size. It does not include conditional class capabilities.
  • Each cell only has one set of $Pr(Class_i|Object), regardless the number of bounding boxes.

Probably, we may be better to focus on our version, which may be slightly different from others…

I quickly skim over original c code, PyTorch code, and our code in Python (Keras). The last one is very very scary… Actually, there are lots of derivatives from original c code. But ours has so may TODO in comments and may excuses not to follow original concept… :fearful:

Our P_c is “objectness” in the definition. This is one of outputs from the convolutional network with applying sigmoid.

Output features from the network are;

t_x, t_y, t_w, t_h (features[0:4]) : bounding (anchor) box shape
t_0 (features[4:5]) : objectness
t_1,t_2,...t_c (features[5:]) : class probability

All are from the network which was trained. “bounding box shape” and “class probability” are obvious for our class members.

Actually, the loss function is to penalize “bounding box coordinates loss”, “classification loss” and “confidence loss”. The last one is the one, and quite important to get higher performance.

Objectness is basically IOU to show how much this bounding (anchor) box covers target object, like “car”. If an anchor box well covers a target object, IOU is very close to 1. Note that this IOU is for a single bounding box internally used in YOLO head, and is not what we learned in a Jupyter notebook which is prepared for a non-max suppression of bounding boxes. (actually not used in our exercise, though…). The important thing is a network needs to learn “0” case (object not exist), not just “1” case (object exist). So, a loss function sets a threshold. If objectness is lower than that, then the label for objectness for that anchor box is considered to be “0”. So, “confidence loss” is calculated to focus on the bounding box which has the largest IOU (label "1) and bounding boxes to be cut off (label “0”). With this, a network can learn how to set the objectness.
As the output of sigmoid function is between 0~1, the value of “Objectness” is the value between 0~1, not 0 or 1.

In net, our P_c is “objectness” to show how much a box covers a target object. Of course, larger is fine.