Car detection with YOLO programming assignment Course 4 Week 3

jjbarnes · May 18, 2026, 10:52am

why does the file;

./model_data/yolo_anchors.txt

contain 10 numbers when there are only 5 anchor box aspect ratios?

rmwkwok · May 19, 2026, 12:14am

Hello @jjbarnes,

They are the scaling constants for the width and height of the boxes, so 5 + 5 = 10 numbers.

Below it shows how we can find the answer:

Enjoy the assignment!

Cheers,
Raymond

jjbarnes · May 19, 2026, 10:34am

But the assignment notebook talks about width/height ratios which is 5 values not 10.

Even if they are widths and heights, what units are they in?

jjbarnes · May 19, 2026, 1:34pm

Have look at this from the assignment notebook…

jjbarnes · May 20, 2026, 11:10am

Any idea yet???

rmwkwok · May 22, 2026, 12:11am

Hello @jjbarnes,

( I deleted my last message and please refer to this one for my response. )

Sorry for getting back late because I don’t spend time here everyday.

First, I believe the sentence regarding the ratio in your screenshot is intended to explain how the boxes are chosen, rather than how they are represented. However, your question is highly reasonable because those decimal numbers look very much like ratios instead of pixel values, which should be typically integers larger than 1.

In this case, they are represented as the scaling constants for the width and height (I have corrected my previous response because it is now more accurate).

Here is how it works:

The notebook implements the four equations from the second paper mentioned in the notebook’s first paragraph. The last two equations show how the width and height are calculated using the network’s outputs and the scaling constants, with all parameters in both sides of the equations unit-less.

They become pixel values after further scaled by the image’s width and height, which is done in the function yolo_eval.

Cheers,

Raymond

jjbarnes · May 22, 2026, 11:23am

Thanks Raymond, that explains it. The notebook description is misleading.

ai_curious · May 22, 2026, 4:05pm

In the equation b_w = p_w * e^{t_w} I also think of one of the two RHS terms as a scaling factor, but in my mind it isn’t the p_w term. That is the static size (width since we’re using w in the example) of one of the anchor boxes learned during the exploratory data analysis phase. The exponential term, which is a dimensionless number, is what I think of as a scaling factor, and that is what the network is learning to predict. If t_w is 0, e^{t_w} equals 1 and the predicted box width, b_w, is equal to the anchor box width, p_w. If t_w > 1. then the exponential term is greater than 1 and the predicted bounding box width is greater than the anchor box width. Conversely, if t_w < 1. the predicted bounding box width is less then the anchor box width.

In this mental model a predicted bounding box shape is always a static anchor box shape multiplied by the scaling factor learned by the network.

Mathematically the same result, just a different (and easier for me) way of conceptualizing. Cheers a_c

edit: in terms of units, I am not sure why the anchor.txt values are stored as floating point, but agree that they are multiplied by the grid cell size deep in the code in order to convert to pixels. A tuple of 1.0, 1.0 in anchors.txt means that anchor box is the same size as the grid cells. 0.5, 0.5 is half the grid cell size. Etc

Here is the code where that unit conversion happens…


box_wh = box_wh * anchors_tensor / conv_dims

where the values read out of anchors.txt comprise the anchors_tensor above. The nomenclature used here, box_wh, is confusing (to me at least) but there is another thread that contains more detailed analysis of the concepts here →

EDIT

In that linked thread I wrote this line from the exercise code…

box_wh = K.exp(feats[…, 2:4])

from which you can see that the box_wh variable referenced in this post above corresponds to the exponential term of the predicted bounding box equation; not b_w and not p_w. That is, feats[…, 2:4]) is exactly t_w (well, and t_h since it is extracting the network outputs corresponding to both width and height in one Python expression).

So this line

box_wh = box_wh * anchors_tensor / conv_dims

takes the exponential term in box_wh, multiplies it by the anchor box shapes in grid cell scale in anchors_tensor, then divides by the convolution sampling scale ( the inverse of the ratio of the image size to the grid cell size if I recall correctly) to arrive at a predicted bounding box width and height in pixels that it overwrites into the same variable box_wh (ugh). Clear as mud, eh?

rmwkwok · May 25, 2026, 11:49pm

Thank you @ai_curious for your feedback and explanations which makes a lot of sense. Your response also reminded me of the paragraph right above the four equations I quoted:

and I think it resonates more with yours than my choice of terms. Clearly I am not experienced enough in YOLO to deliver the accurate semantics.

Cheers,
Raymond

Topic		Replies	Views
Programming Exercise - Anchor Boxes Convolutional Neural Networks coursera-platform	3	713	June 19, 2022
About the prediction of yolo boundary box prediction Convolutional Neural Networks week-module-3 , coursera-platform	1	39	September 21, 2024
ANCHOR BOX(yolo_head) Convolutional Neural Networks coursera-platform	4	655	September 4, 2022
How do you setup "yolo_anchors.txt"? Convolutional Neural Networks coursera-platform	6	731	February 5, 2023
Question about Autonomous Driving - Car Detection lab Convolutional Neural Networks coursera-platform	2	550	May 24, 2023

Car detection with YOLO programming assignment Course 4 Week 3

Related topics