I was wondering, prof Ng mentioned when talking about grid cells that we used classification and localization algorithm on each cell.
Does that mean that the labels are automatically generated for each cell? if so in case of finer grid I think the algorithm won’t work well to classify and localize, or Did I get something wrong?
Thanks in advance.
Dear @MustafaaShebl ,
In classification and localization , the output is divided into the cells. Please note that this number of cells is fixed both during the training and inference.
- Labels: Not auto-generated; they come from ground truth annotations (manual or semi-automatic) and are mapped to grid cells for training. Our model doesn’t create them. But, instead the model only gives us predictions ( AKA y hat )
- Finer grids: May or may not improve the detection. For example, if the object sizes are mostly small, then finer grid would be more helpful. Whereas, if objects are mostly large, finer grids may not work well.
@MustafaaShebl Yes making the grid too fine can cause inefficiencies. Therefore, the detection models use techniques like anchor boxes, Non-Maximum Suppression (NMS) and multi-scale feature maps to improve accuracy. If you see the videos Prof. Andrew Ng has talk about these techniques.
Thanks for your fast response.
Can you please clarify “we apply classification and localization algorithm on each grid cell” part?
what I understand is we choose grid size according to the application and object to detect and that we ourselves createthe bounding boxes and therefore the target label Y.
I think this is a correct summary. It’s an engineering tradeoff. More grid cells means capability to detect more objects per image. But also since the entire prediction vector is computed for each grid cell + anchor box, computation and memory scale with ^2 of the grid size meaning there is a business-driven practical upper bound.
The ground truth labels are initially associated with the training input, X. Often they are provided in a text file and may be XML or JSON. For YOLO training, these labels must be mapped in a preprocessing step to a matrix sharing the network output shape - what you refer to as target label Y above. During training the ground truth labels are iteratively compared with the network generated predicted output, \hat{Y}.
There is some related discussion here: Week 3: finding the correct cell in YOLO
Thank you so much, so the ground truth labels are given we practically don’t use a sort of algorithm to give the labels like classification and localization algorithm for each cell, and these XML or JSON file given by let’s say labelers is preprocessed to be mapped to ground truth label/network output shape.
Did I get it right?