I think it’s a reasonable question, since the most important step, “training” is not included in this exercise. The problem is, Yolo has multiple versions and different implementations. So, it is quite difficult to say how it is implemented. The best way is to read some of key papers like v2 and v3. There are several newer versions, but those are not done by an original developer.
This exercise is based on v2, but is not identical. Yolo v2/v3 were implemented in C. This exercise is basically Python with some ported code on Karas.
I will try to explain v2/v3 implementations as much as possible for your guidance, but, eventually, you may need to go back to papers.
But here in the object detection case, how do I get my ground truth Y based on the ground truth bounding boxes?
Of course, there is no ground truth at the inference time. So, let’s discuss about the training time.
The most important part in the training time is the loss function, i.e, how the network should be trained. Basically, the loss includes
- Differences of bounding (anchor) box location and size
- Objectness (Pc)
- Object class
With this loss function, the network is trained to generate the most possible anchor boxes with an objectness and object class to minimize the above losses. And implementation differ in the version of Yolo. Newer versions set the ground truth for training to be more anchor box oriented, i.e, adding “anchor box”, “grid location”, etc. so that the loss can be easily calculated.
Do I need to assign the bounding box to the grid that has the bounding box’s centroid and then give each anchor the respective values (like if anchor 1 belongs to a car, anchor 2 belongs to a pedestrian and my image has a pedestrian, then anchor 1 has Pc → 0 and other values as don’t care whereas anchor 2 gets the Pc as 1, bounding box’s centroid that falls into the grid cell and the class)?
This paragraph is slightly difficult to understand, but try one by one. At first, what we are talking is “anchor boxes”, which are pre-defined boxes for object detections, not “bounding box”.
And, as you wrote, the centroid of an anchor box determines the owner of an anchor box. In your case, you have two anchor boxes in one grid. So, there is a possibility that each detects the different objects to fit to the shape of an anchor box. And, those are independent. So, the anchor 1 catches a car with its objectness and the class number, and the anchor 2 catches a pedestrian with its objectness and the class number.
The problem is, in each grid, only two objects can be detected in the maximum case. So, newer version uses a smaller grid (i.e, increasing number of grids), and also using outputs from multiple layers in the backbone network to cover small/mid/large objects. Here is an overview of the architecture.
(Source : Chen, Shi & Demachi, Kazuyuki. (2020). A Vision-Based Approach for Ensuring Proper Use of Personal Protective Equipment (PPE) in Decommissioning of Fukushima Daiichi Nuclear Power Station. Applied Sciences. 10. 5129. 10.3390/app10155129)
One thing that I should add is, in V2, the structure is simple. We have a backbone network (Darknet) and Yolo head. From V3, one layer is added, which is called Yolo Neck to implement Feature Pyramid Network. That’s the picture above.
V3 uses 9 anchor boxes. (V2 uses 5). And, 3 anchors are assigned to each layer.
Back to your case, you have 7x7 grid cells. And 2 anchor boxes for each. In your first definition, there are three classes. (This is a probability distribution. So, in the case of c1, c2, c3, we have 3 classes.) So, the shape from the network should be (m, 7, 7, (2x(4+1+3)) = (m, 7, 7, 16). Then, in Yolo head or Yolo Eval, you extract information about anchor boxes like (x, 7, 7, 8).
If you look at the loss function for this exercise, it will be more complex to calculate IoU and additional filtering based on Pc. But, key thing is, x, y, w, h, Pc, class info,… all come from the network trained with the loss function.
And, NMS and some other final process are not attached to the network at the training in the V3. This is to focusing on the training of the network based on the loss function.
Hope this helps some.
I want to try it out myself, so any kind of help would be highly appreciated.
This is really good thing. V5 has Pytorch version, and V3 has Keras version. Please select your preferable version to fit to your purpose and environment.