I’m watching this video, Non-max Suppression, and have a question regarding an example presented in a video. In this example, the detection area is divided into a 19x19 grid. We are trying to detect objects in each cell, but I noticed that each cell (highlighted in green and yellow) is significantly smaller than the car being detected. Given that the training examples might be entire cars, how does the algorithm accurately detect parts of a car (like a door, a window, or a wheel) within these small cells? Specifically, how does it recognize that these individual components should be labeled as a car?
I’m trying to understand the connection between the small-scale detection in each grid cell and the labeling process based on the training on whole objects. Any insights or explanations would be greatly appreciated!
The detection of objects happens independent of the grid cells and there is no requirement that an object be contained in a grid cell. The grid cells are just used to organize the output, because a given detected object is attached to the grid cell that contains its centroid. The training of the network is for detecting whole objects and that is driven (as in all “supervised learning” cases) by how the input training data is labeled.
There are a number of threads on the forums that go into quite a bit more depth on how YOLO works and is trained than we get in the lectures or the assignments. Here’s a good one to start with that discusses how the training works.
Unfortunately, the language used by Prof Ng here is not a precise description of what the YOLO algorithm does. At runtime, YOLO inputs the entire image once, runs forward propagation once, and outputs one matrix comprised of all the predictions for all of the grid cells. There is an explicit reference to this made by Prof Ng at 3:43 of the YOLO algorithm video. YOLO does not run a CNN forward propagation per grid cell as might be reasonably inferred from the transcript excerpts in this thread.
Andrew tends to lecture using broad intuitions that easily convey the concepts. He often omits (or simplifies) a lot of specific details - since he has no idea how much specific experience the audience has.
I agree. The language used in these videos conceptually kind of straddles the boundary between convolutional sliding windows, introduced previously, and YOLO, introduced subsequently, without explicit reference to either.
Another challenge is that these lectures are a snapshot in time. They might describe a version of an algorithm that was current when the video was recorded, but not reflect current practice or even the latest state of the related programming exercises. I think these videos are from 2017 ish