Week 3: Data Structure for Obj Detection

i believe there are 2 unresolved issues with the lecture notes…

a)number of classes: i believe you only need ‘n-1’ variables to match the number of classes… setting n-1 variables to 0 can encode information on any arbitrarily chosen class

b) need for anchor boxes: why cant we instead use a differential positioning scheme? the first obj will have the closest distance from (0,0), second obj will be second closest form (0,0), etc. you could define distance as ‘x+y’ instead of the L2 definition

I’m not clear where in the ‘lecture notes’ you’re referring to. But on this question, anchor boxes have shape (width and height) only. They are used to help scale the bounding box shape prediction, not its center location. This related thread might be of interest.

i was wondering if anchor boxes are unwieldy, instead why not encode the vector in terms of distance relative to the origin?

the first block of pc, x, y, bw, bh, c1, c2,… will be encoded to the box closest to the origin (let’s say the bottom left corner of the image)

the second block of pc, x, y, bw, bh, c1, c2,… will be encoded to the box which is second closest to the origin and so on, to build out the stacked vector while training

Still very unclear to me what your thought process is. Again, anchor boxes have no location, so you can’t measure their distance from anything. Maybe you’re thinking of grid cell?

Regarding the shape, maybe you’re suggesting to flatten the entire output of the network, instead of using a 4D object? I agree the 4D output is unwieldy to think about or visualize, but it behaves quite nicely with Python vectorized mathematical operations. Maybe you can explain where you see the benefit of your proposed output shape? Is it more efficient computationally? Less memory etc.

i believe the anchor boxes uniquely encode the location of a box in a vector, we can use distances to substitute that, just like the probability of having 2 boxes of the same type in the same vector is assumed to be zero, we can make similar assumption about distances and stack the vector accordingly…
the tensor representation is only used for programming, from what i understand the mathematical representation still uses a stacked vector

edit: unwieldy in the sense that a preprocessing step is required to set up the training data which could be done away with if we used just distances instead of anchor boxes, the mapping will still be unique

I do not quite understand what you mean by “location of a box in a vector”.

the first stack of pc, x,y, bw,bh, c1,… etc. belongs to anchor box 1
instead the first stack will belong to the box closest to the origin

the second stack of pc, x,y, bw,bh, c1,… etc. belongs to anchor box 2
instead the second stack will belong to the box second closest to origin
and so on…

so instead of using anchor boxes to guide the location/position of the box in the vector, i would instead use distances

Except anchor boxes don’t have any location, so they can’t be ordered by distance

Anchor boxes have only shape. Below are the values in yolo_anchors.txt used in the YOLO programming exercise for this class:

0.57273, 0.677385, 1.87446, 2.06253, 3.33843, 5.47434, 7.88282, 3.52778, 9.77052, 9.16828

Those are pairs of values representing shape. Anchor box 1 is (0.57273, 0.677385), Anchor box 2 is (1.87446, 2.06253). Anchor box 3 is (5.47434, 7.88282). Etc. The numbers 1, 2, 3 etc are cardinal, not ordinal. You can arrange the pairs of values in any order and not change the operation of the algorithm.

thats the point of my query, why not assign uniqueness thorugh a method other than anchor boxes? like closeness from origin, i am just trying to completely do away with the idea of anchor boxes while still preserving uniqueness of represntation… its a proposed substitute for the data structure, in fact i could rewrite the vector using differential distances
so instead of x,y, i would rewrite the vector as delta_x,delta_y where the delta’s are computed from the previous box

so for the closest box it would be 2,3 that would mean 2,3 from origin
for the next closest box it would be -1,3 that would mean a xdelta of -1 and ydelta of 3 from the box above
next you would have -3,6 that would mean delta’s from the previous box
the constraint to follow would be delta_x+delta_y>0 while figuring out a distribution to enforce this condition

If you think that will work better, then write the code, do the measurements, publish the paper. Redmon’s YOLO papers have over 20K academic citations. Let us know how it works out for you!

i am interested in a fundamental explanation… not an explanation which involves looking at out of sample performance without being able to conceptually pin down the differences and potentially desirable or undesirable properties that sound like"it just works but i dont know how"

secondly, just changing the data representation doesnt mean it becomes a new model, it still uses the core aspects of YOLO model, i.e.
a) the breaking up of an image into standardized sections and
b) the use of IOU or any other similar method to adjudicate redundancies