The dimension for anchor boxes is the second to last dimension in the encoding: (π,ππ»,ππ,πππβπππ ,ππππ π ππ ).
Actually, I donβt get it quite well, I thought that the anchor boxes were defined by ππ» and ππ ( its height and width). And the dimensions of each ancho box should be (ππ,ππ₯,ππ¦,πβ,ππ€,πlasses)
I hope you could help me clarify this!
Thanks a lot!
The dimension of the anchor boxes mentioned here refers to the dimension of the training set. There are m images in the training set, with height n_H, width n_W, and they belong to a particular anchor (anchors), and particular class (classes).
The dimension you are referring to with (pc, bx, by, bh, bw, classes) is the dimension of the output of the model.
There are 3 values one might consider dimensions related to anchor boxes. First, is the number of anchor boxes being used. In the original YOLO v2 research paper, the number was 2. In the car detection programming exercise the number is 5. The other two dimensions are the height and width of the anchor boxes themselves. There is a utility file in the exercise called yolo_anchors.txt that contains 10 valuesβ¦height and width for each of the 5 anchor boxes. This is completely independent from the shape of the input image as well as the number of training images being used. In the original post above, anchors is 5.
The statement in the assignment refers to the total collection of scores based on the anchor boxes:
βThe dimension for anchor boxes is the second to last dimension in the encoding: (π, π_π», π_π, πππβπππ , ππππ π ππ )
The YOLO architecture is: IMAGE (m, 608, 608, 3) β DEEP CNN β ENCODING (m, 19, 19, 5, 85).β
For clarity, it might have been better if the text had read something like βThe dimension of the encoding tensor based on the anchor boxes is (π, π_π», π_π, πππβπππ , ππππ π ππ )β.
If the lower case encoding (π, π_π», π_π, πππβπππ , ππππ π ππ ) and the upper case ENCODING (m, 19, 19, 5, 85) are supposed to be referring to the same thing, then like the original poster I find them incongruent. The word classes in this exercise means 80, does it not? It should be (1 + 4 + classes).
Also, itβs a little imprecise to state that (pc, bx, by, bh, bw, classes) is the dimension of the output of the model.. Shouldnβt that be S*S*B*(1+4+classes)
Yes, it all is imprecise. It seems to me that the author of the assignment was trying to say a number of things at the same time, squeezing everything into an imprecise statement. Fortunately, this has not led to much confusion so far, as there has only been one question about this since the refresh. But Iβll report it at the backend.