Dimension for anchor boxes

In the programming exercises, we have:

  • The dimension for anchor boxes is the second to last dimension in the encoding: (π‘š,𝑛𝐻,π‘›π‘Š,π‘Žπ‘›π‘β„Žπ‘œπ‘Ÿπ‘ ,π‘π‘™π‘Žπ‘ π‘ π‘’π‘ ).
    Actually, I don’t get it quite well, I thought that the anchor boxes were defined by 𝑛𝐻 and π‘›π‘Š ( its height and width). And the dimensions of each ancho box should be (𝑝𝑐,𝑏π‘₯,𝑏𝑦,π‘β„Ž,𝑏𝑀,𝑐lasses)
    I hope you could help me clarify this!
    Thanks a lot!

Hi jacknguyen101,

The dimension of the anchor boxes mentioned here refers to the dimension of the training set. There are m images in the training set, with height n_H, width n_W, and they belong to a particular anchor (anchors), and particular class (classes).

The dimension you are referring to with (pc, bx, by, bh, bw, classes) is the dimension of the output of the model.

There are 3 values one might consider dimensions related to anchor boxes. First, is the number of anchor boxes being used. In the original YOLO v2 research paper, the number was 2. In the car detection programming exercise the number is 5. The other two dimensions are the height and width of the anchor boxes themselves. There is a utility file in the exercise called yolo_anchors.txt that contains 10 values…height and width for each of the 5 anchor boxes. This is completely independent from the shape of the input image as well as the number of training images being used. In the original post above, anchors is 5.

Hi ai_curious,

The statement in the assignment refers to the total collection of scores based on the anchor boxes:

β€œThe dimension for anchor boxes is the second to last dimension in the encoding: (π‘š, 𝑛_𝐻, 𝑛_π‘Š, π‘Žπ‘›π‘β„Žπ‘œπ‘Ÿπ‘ , π‘π‘™π‘Žπ‘ π‘ π‘’π‘ )
The YOLO architecture is: IMAGE (m, 608, 608, 3) β†’ DEEP CNN β†’ ENCODING (m, 19, 19, 5, 85).”

For clarity, it might have been better if the text had read something like β€˜The dimension of the encoding tensor based on the anchor boxes is (π‘š, 𝑛_𝐻, 𝑛_π‘Š, π‘Žπ‘›π‘β„Žπ‘œπ‘Ÿπ‘ , π‘π‘™π‘Žπ‘ π‘ π‘’π‘ )’.

1 Like

If the lower case encoding (π‘š, 𝑛_𝐻, 𝑛_π‘Š, π‘Žπ‘›π‘β„Žπ‘œπ‘Ÿπ‘ , π‘π‘™π‘Žπ‘ π‘ π‘’π‘ ) and the upper case ENCODING (m, 19, 19, 5, 85) are supposed to be referring to the same thing, then like the original poster I find them incongruent. The word classes in this exercise means 80, does it not? It should be (1 + 4 + classes).

Also, it’s a little imprecise to state that (pc, bx, by, bh, bw, classes) is the dimension of the output of the model.. Shouldn’t that be S*S*B*(1+4+classes)

Yes, it all is imprecise. It seems to me that the author of the assignment was trying to say a number of things at the same time, squeezing everything into an imprecise statement. Fortunately, this has not led to much confusion so far, as there has only been one question about this since the refresh. But I’ll report it at the backend.