Questions about YOLO Algorithm

Sorting based on training set occurrences might be one approach, but it isn’t what the YOLO inventors did; they ran a K Means clustering analysis. I don’t believe you are guaranteed that K Means cluster centroids represents any actual data set member. (Anchors were introduced in the second 2016 YOLO paper and called ‘priors’)

It also seems like the simple sorting approach could suffer from imbalance in the training set, resulting in good predictions on a certain similar shape (say the top 5 occurrences all relate to nearby motor vehicles) but doing poorly on others (classes with different aspect ratios like traffic signs or humans or on less common sizes such as for same class but different distance). One can imagine a data set with 5 shapes having more than one occurrence but that these are grossly different from the size and shape of the vast majority of detection targets all of which happen to have unique shapes in the data set. K Means with IOU helps protect against these.

EDIT - not sure that anyone reads these old threads, but if you’re here and want to understand more about how the YOLO inventors decided on which anchor boxes to use, and quantitatively why their approach is superior to just selecting the most common shapes in the training data, take a look here…[Deriving YOLO anchor boxes]

Like most things in machine learning, I don’t think there is a single simple universally applicable answer. It requires engineering tradeoffs on the data set, the runtime environment, and the business problem/domain. The ‘correct’ answer for self-driving vehicle probably won’t be the same as subject identification on a mobile device camera etc