i believe there are 2 unresolved issues with the lecture notes…
a)number of classes: i believe you only need ‘n-1’ variables to match the number of classes… setting n-1 variables to 0 can encode information on any arbitrarily chosen class
b) need for anchor boxes: why cant we instead use a differential positioning scheme? the first obj will have the closest distance from (0,0), second obj will be second closest form (0,0), etc. you could define distance as ‘x+y’ instead of the L2 definition
I’m not clear where in the ‘lecture notes’ you’re referring to. But on this question, anchor boxes have shape (width and height) only. They are used to help scale the bounding box shape prediction, not its center location. This related thread might be of interest.
Still very unclear to me what your thought process is. Again, anchor boxes have no location, so you can’t measure their distance from anything. Maybe you’re thinking of grid cell?
Regarding the shape, maybe you’re suggesting to flatten the entire output of the network, instead of using a 4D object? I agree the 4D output is unwieldy to think about or visualize, but it behaves quite nicely with Python vectorized mathematical operations. Maybe you can explain where you see the benefit of your proposed output shape? Is it more efficient computationally? Less memory etc.
i believe the anchor boxes uniquely encode the location of a box in a vector, we can use distances to substitute that, just like the probability of having 2 boxes of the same type in the same vector is assumed to be zero, we can make similar assumption about distances and stack the vector accordingly…
the tensor representation is only used for programming, from what i understand the mathematical representation still uses a stacked vector
edit: unwieldy in the sense that a preprocessing step is required to set up the training data which could be done away with if we used just distances instead of anchor boxes, the mapping will still be unique
Those are pairs of values representing shape. Anchor box 1 is (0.57273, 0.677385), Anchor box 2 is (1.87446, 2.06253). Anchor box 3 is (5.47434, 7.88282). Etc. The numbers 1, 2, 3 etc are cardinal, not ordinal. You can arrange the pairs of values in any order and not change the operation of the algorithm.
thats the point of my query, why not assign uniqueness thorugh a method other than anchor boxes? like closeness from origin, i am just trying to completely do away with the idea of anchor boxes while still preserving uniqueness of represntation… its a proposed substitute for the data structure, in fact i could rewrite the vector using differential distances
so instead of x,y, i would rewrite the vector as delta_x,delta_y where the delta’s are computed from the previous box
so for the closest box it would be 2,3 that would mean 2,3 from origin
for the next closest box it would be -1,3 that would mean a xdelta of -1 and ydelta of 3 from the box above
next you would have -3,6 that would mean delta’s from the previous box
the constraint to follow would be delta_x+delta_y>0 while figuring out a distribution to enforce this condition
i am interested in a fundamental explanation… not an explanation which involves looking at out of sample performance without being able to conceptually pin down the differences and potentially desirable or undesirable properties that sound like"it just works but i dont know how"
secondly, just changing the data representation doesnt mean it becomes a new model, it still uses the core aspects of YOLO model, i.e.
a) the breaking up of an image into standardized sections and
b) the use of IOU or any other similar method to adjudicate redundancies