In the figure 4 given in the programming notebook, while calculating probability score (pc x ci) for a class i of a particular anchor box in the output of a grid cell of the input image. Why don’t we just take the max of all the class probabilities first and then multiply it with pc to get the probability score of detecting a particular class of object within that particular anchor box? It would save a lot of redundant multiplications (79 in this case).

Hi @shailesh_dagar ,

I am not sure if I got your point so, you are free to add your comments

The reason we don’t do what you say is because we want to preserve the individual class probabilities. In YOLO model, each bounding box is associated with a class label and a corresponding class probability. The calculation pc * c, is done for each class, and the class with the highest score is assigned to the bounding box.

If we were to take the maximum class probability first and then multiply it with pc, we would be assuming that the bbox can only contain the object of the class with the highest probability. This would not be correct, as the bbox could contain an object of any class. Moreover, taking the maximum class probability first would not necessarily save computational resources. The class probabilities are computed by the model during the forward pass, and this computation is necessary regardless of if wetake the maximum first or not. The number of multiplications would remain the same, as we still need to compute the class score for each class.

Keep learning!

I’m confused about a couple of these statements

Isn’t that exactly what we want to do, assign one and only one class to each predicted bounding box? We could just assign the max class probability, but here we weight by the object presence confidence, which is class independent. Seems like you can take the maximum value after the multiplication or do the multiplication after extracting the highest confidence class, but in either case, isn’t the numeric result the same? As far as I can tell, there is no loss of information by taking the max first, as the vector of class predictions output by the neural net is still there, unchanged, to do anything you want with. But you would end up with a pure scalar multiplication rather than a broadcast.

If by *during* you mean *once at the completion of* then I agree that the unweighted class scores are outputs of the neural network and all the work to produce that vector of outputs is done in the last layer of every forward pass. But this weighting of the raw class score prediction by the object presence prediction isn’t done by the neural net. Rather, it’s a postprocessing step.

I’m team @shailesh_dagar on this one, at least conceptually. I’m not positive that the weighting multiplication described in this part of the notebook is actually a part of the YOLO v2 implementation, so maybe this is a purely theoretical discussion.