Maybe you can elaborate on why your design approach would be better. More accurate? Runs faster? Both? Etc
There are two contributions that anchor boxes make to the effectiveness of YOLO.
First is that anchor boxes provide reasonable baselines for bound box shape/size predictions. As stated in the papers, the values of the bounding box predicted width and height are related to anchor box shapes as b_w = p_w * e^{t_w} and b_h = p_h * e^{t_h} where t_w and t_h are outputs of the neural network and (p_w,p_h) is the shape of an anchor box. As detailed in other threads, the anchor box shapes (p_w, p_h) are based on shapes of objects in the training data, not on the classes/types of those objects.
Second is that it helps the network to make multiple object detection predictions on each input image. As mentioned above and in the self_driving car programming exercise, the YOLO network makes S*S*B detections on each image from each forward pass, and each detection is (1 + 4 + C) floating point numbers. The 1 is the object presence prediction, the 4 are the bounding box center location and shape predictions, and C is the vector of class probabilities.
Doing away with anchor boxes entirely would remove the positive impact they have on convergence for the anchor box shape training as you would start with random initialization instead of using the common training data shapes. Further, if you changed from the number of anchor boxes used (generally an integer less than 10) to the number of classes, it would substantially increase the number of predictions you would have to make from each forward pass. By that I mean the course uses B = 5 and C = 80. If you did a full detection for each class, the network output size increases by 16x. But it is common to train on 1,000 types. And ImageNet actually contains some 20,000 of what they call categories that correspond to what we call classes. The original YOLO took a week to train, so I’m not sure how practical it would be if B = 20,000.
Finally, the bounding box center location and shape predictions are not dependent on the class prediction. They are entirely separate outputs of the network. So how and why would the network produce different bounding box center location and shape predictions for a given object? Wouldn’t they be the same for all possible classes? The features extracted from a certain region of the image suggest where an object is regardless of what it might be.
Sorry, but I don’t yet see the upside of this proposed design.