why does the file;
./model_data/yolo_anchors.txt
contain 10 numbers when there are only 5 anchor box aspect ratios?
Hello @jjbarnes,
They are the scaling constants for the width and height of the boxes, so 5 + 5 = 10 numbers.
Below it shows how we can find the answer:
Enjoy the assignment!
Cheers,
Raymond
But the assignment notebook talks about width/height ratios which is 5 values not 10.
Even if they are widths and heights, what units are they in?
Any idea yet???
Hello @jjbarnes,
( I deleted my last message and please refer to this one for my response. )
Sorry for getting back late because I don’t spend time here everyday.
First, I believe the sentence regarding the ratio in your screenshot is intended to explain how the boxes are chosen, rather than how they are represented. However, your question is highly reasonable because those decimal numbers look very much like ratios instead of pixel values, which should be typically integers larger than 1.
In this case, they are represented as the scaling constants for the width and height (I have corrected my previous response because it is now more accurate).
Here is how it works:
The notebook implements the four equations from the second paper mentioned in the notebook’s first paragraph. The last two equations show how the width and height are calculated using the network’s outputs and the scaling constants, with all parameters in both sides of the equations unit-less.
They become pixel values after further scaled by the image’s width and height, which is done in the function yolo_eval.
Cheers,
Raymond
Thanks Raymond, that explains it. The notebook description is misleading.
In the equation b_w = p_w * e^{t_w} I also think of one of the two RHS terms as a scaling factor, but in my mind it isn’t the p_w term. That is the static size (width since we’re using w in the example) of one of the anchor boxes learned during the exploratory data analysis phase. The exponential term, which is a dimensionless number, is what I think of as a scaling factor, and that is what the network is learning to predict. If t_w is 0, e^{t_w} equals 1 and the predicted box width, b_w, is equal to the anchor box width, p_w. If t_w > 1. then the exponential term is greater than 1 and the predicted bounding box width is greater than the anchor box width. Conversely, if t_w < 1. the predicted bounding box width is less then the anchor box width.
In this mental model a predicted bounding box shape is always a static anchor box shape multiplied by the scaling factor learned by the network.
Mathematically the same result, just a different (and easier for me) way of conceptualizing. Cheers a_c
edit: in terms of units, I am not sure why the anchor.txt values are stored as floating point, but agree that they are multiplied by the grid cell size deep in the code in order to convert to pixels. A tuple of 1.0, 1.0 in anchors.txt means that anchor box is the same size as the grid cells. 0.5, 0.5 is half the grid cell size. Etc
Here is the code where that unit conversion happens…
box_wh = box_wh * anchors_tensor / conv_dims
where the values read out of anchors.txt comprise the anchors_tensor above. The nomenclature used here, box_wh, is confusing (to me at least) but there is another thread that contains more detailed analysis of the concepts here →
EDIT
In that linked thread I wrote this line from the exercise code…
box_wh = K.exp(feats[…, 2:4])
from which you can see that the box_wh variable referenced in this post above corresponds to the exponential term of the predicted bounding box equation; not b_w and not p_w. That is, feats[…, 2:4]) is exactly t_w (well, and t_h since it is extracting the network outputs corresponding to both width and height in one Python expression).
So this line
box_wh = box_wh * anchors_tensor / conv_dims
takes the exponential term in box_wh, multiplies it by the anchor box shapes in grid cell scale in anchors_tensor, then divides by the convolution sampling scale ( the inverse of the ratio of the image size to the grid cell size if I recall correctly) to arrive at a predicted bounding box width and height in pixels that it overwrites into the same variable box_wh (ugh). Clear as mud, eh?
Thank you @ai_curious for your feedback and explanations which makes a lot of sense. Your response also reminded me of the paragraph right above the four equations I quoted:
and I think it resonates more with yours than my choice of terms. Clearly I am not experienced enough in YOLO to deliver the accurate semantics. ![]()
Cheers,
Raymond