Week 3 A1 Part 1 Tensor dimensions clarification

Good day,

I need a refresher on Tensor Dimensions. I need to make sure I am understanding correctly.
So by my understanding

boxes – tensor of shape (19, 19, 5, 4) - so the first 2 dimensions are the dimensions of the image after encoding. The 3rd dimension represents the 5 anchor boxes for each of these 19^2 grid cells and the 4 represents bx, by,bh,bw for each of the 5 anchor boxes, in each of these 19^2 grid cells.

box_confidence – tensor of shape (19, 19, 5, 1) - This encodes for each anchor box in each grid cell the confidence that there is some object detected.

box_class_probs – tensor of shape (19, 19, 5, 80) - This encodes for all 80 classes, in each of the 5 anchor boxes for each of the 19^2 grid cells the probabilities that class is present in the anchor box. So for a particular grid cell, and for each of the 5 anchors boxes in that grid cell there is an 80 dimensional vector with all the class probabilities for the 80 classes.

So now we are told box_scores is of dimension (19,19,5,80). Here is my first down break in understanding. Why do we need to calculate box_scores. Surely box_class_probs already encodes all the necessary information?

Now I get the following shapes:
Box scores shapes (19, 19, 5, 80)
Box classes [19 19 5]
Box class scores[19 19 5]

Now I understand how we get the shape for Box_Scores but I don’t quiet understand what information is encoded by the next 2. But I think Box classes shape is that it represents for each of the 5 anchor boxes, in each of the 19^2 grid cells the class with highest probability and Box_class_scores represents the corresponding score associated with each of those classes.

Is my understanding mostly correct?

I also need a hint for how I make the filtering mask have the same dimensions as box_class_scores.

Just to be clear, b_x, b_y, b_w, and b_h are the predicted center location and predicted shape of an object bounding box…not the location and shape of an anchor box. So, yes, there is a vector of length 4 associated with each grid cell + anchor box matrix location (the YOLO papers sometimes refer to these as detectors) but the b_{…} values are bounding box predictions, not anything about the shape of the anchor boxes themselves.

Maybe review the notebook markup discussion of p_c * c_i and think about what that would look like in code.

Finally, if you had a 19x19x5x80 and you wanted to extract each of the maximum values from the 80 dimension, what shape would the result be? This implementation makes the choice to manage class prediction and score in separate, parallel data structures. The shape of these two reflects that choice. HTH

Thank you that does help a lot and thanks for correcting my understanding of bounding boxes. I was actually interpreting bx,by,bw and bh wrong (as a location and shape of anchor box instead of predicted bounding box values).

1 Like

Anchor boxes in YOLO are a complex and nuanced concept that trips up a lot of people. I didn’t have my :bulb:until I tried training a YOLO model myself. If you look carefully at the equations in the paper(s) or the code, you see that the bounding box shape predictions b_h and b_w are expressed as b_h = p_h * e^{t_h} and b_w = p_w * e^{t_w} where p stands for anchor box shape, or priors, and t is the actual numeric value output by the network. Thus each predicted bounding box shape is proportional to its associated anchor box, or prior, shape. If the network output t == 0 then e^t == 1. and b == p. Notice that this explains another YOLO- related question many people ask, which is how a predicted bounding box can be larger than its associated anchor box (or grid cell). If the network output t > 0 then e^t > 1 and \frac{b}{p} > 1.

Priors because their shape is determined during exploratory data analysis on the training set, prior to actual network training. They become in effect the baseline estimates for shape predictions.