In the videos, Andrew uses a 3x3 grid over the car image(with the mountain road and snow). With 3x3 the intuition works quite well because the car fits in a cell, but then he says that in practice a smaller grid is used. But, if we overlay a 19x19 grid then each grid cell will contain only a tiny portion of the car. How can the network predict an accurate bounding box since it will just say that the bounding box is the grid cell in which it predicted the car(but the car is inside many grid cells and each cell contains a small part of the object)?
It is my understanding that for each grid cell the network will produce an output vector. Each output vector also encodes a bounding box, which in this case will contain only a small portion of the object.