Suppose we have 3 classes: pedestrian, car and motorcycle - does it make sense to pick 3 corresponding anchor boxes, or not necessarily? For example, one narrow, tall anchor box to ‘catch’ the pedestrian, a square to ‘catch’ the motorcycle, and a rectangular to ‘catch’ the car?
Or rather, we choose a bigger number of anchor boxes?
Is there at all any relation between the number of classes and number of anchor boxes?
Also, must anchor boxes be made of straight lines? Or could thay be a circle shape for example?
Hi Doron_Modan,
You will want to find the shape of the bounding box that best detects the classes you want to detect. Have a look at this blogpost. This includes setting the shape of the bounding box e.g. to circle, which is then called a bounding circle. See, e.g., this article about medical object detection.
I disagree. In my experience you want to find the shapes of bounding boxes that best represent the shapes of the objects in your training data, not the classes. Anchor shape is primarily about good localization, not classification. If you have a lot of nearby cars and a lot of far away cars in your data, you likely need at least 2 anchor box shapes to localize them…not one for all cars. Further, if the objects in your training data don’t represent the objects you want to predict, you have another, different, problem.
The YOLO designers used K-means to derive good anchor box shapes. There is a thread about it in this forum here: Deriving YOLO anchor boxes
Indeed, I should have written “bounding boxes”. This is exemplified in the first link in my previous post.
The number of anchor boxes in a YOLO architecture directly impacts the shape of the network output and the amount of computation. Unlikely one could afford the memory or compute time to support a network with enough anchor boxes to assign one per class for ImageNet, for example, which contains 1,000 classes.
I found in my own analyses that there were constantly diminishing returns in accuracy from increasing the number of anchor boxes, and the cost/benefit tradeoff of accuracy vs compute was around 8 anchor boxes. That number will vary depending on the data set. If you have few classes but different proximity you probably want more anchor boxes than classes. If you have lots of classes, you will have significantly fewer anchor boxes, like even 2 orders of magnitude fewer. Hope this helps.
Ps: note that it’s not just aspect ratio, ‘taller than wide’ that matters for anchor box shape, but the actual size in pixels. Anchor boxes that aren’t close in shape to training objects causes problems in training.
Also note that at least in my quick first read, the article linked above talks about the size and shape of anchor boxes, not a ‘face’ anchor box.