Hi,
I have several questions about how YOLO algorithm works. Here are my questions:
In the course Andrew mentions that Image classification is applied to each of the individual “grid cells”. In case where an object could span across multiple grid cells, how could the algorithm achieve to still identify the center point of the object, and span the bounding boxes beyond a particular grid cell?? What actually happens when object is spread across multiple grid cells?
In the forward pass, there are chances that algorithm might detect multiple bounding boxes for a single object. Why does this happen? The forward pass happens only once per image, so how are we ending up with multiple predictions of bounding boxes per image?? I know it finally gets suppressed, but how does it end up with so many bounding boxes in the intermediate stages?
For YOLO training, can we feed images with multiple objects in a single image? or should it be a cropped image with single class? How does the training image annotated? will it be 3x3x2x8? or would it be a 1x8 vector representing a specific class?
Thanks for pointing out to detailed description!
Few things are clear, but still not able to connect the dots all together!
Like the earlier sliding window concept was very clear! But after that when we talk about YOLO, it is not clear on how each grid can detect an object which is larger than the grid. (However the representation of Grid and its corresponding outputs are clear).
Also, on my Question No.3, What would be the format of a single labelled training image?
(this works even if this results in values larger than the grid cell dimension – it is exactly the mechanism that allows YOLO to predict bounding box shapes larger than one grid cell).