Hi, I struggled with questions similar to the ones you ask, and I hope I can help you reach a satisfying intuition. But before I answer it I want to make sure you understand some things, if you do that’s great!
Firstly we need to understand that the “grid” is only really existent and used in training i.e there is no actual defined grid created in the actual convolutional network architecture of YOLO. The grid is just used for us humans, to provide a simple intuitive way to label objects. Here’s how you can think about this.
You are an ML engineer and wants to teach a CNN to learn to classify different objects with bounding boxes. But you know sliding etc is computationally expensive and you come up with the idea that you want 1 look, 1 iteration of CNN to classify all objects. But how would you even train this network? You clearly need an output that somehow labels all images with bounding boxes as consistently as possible. Therefore you think it is a good idea to create a s by s grid, and label objects only for the grid cells where the midpoint of an object lies in the cell. Saves you a lot of work. You also want a bounding box so you make the cell output that as well.
And then you decide, well since I am using an s by s grid, and each cell may or may not have an object, you already know your imagination 1 look CNN must have an output of s by s by (things you want one cell to output), which would be s by s by y(which would be 4 bounding box co-ordinates + number of classes+1 for is there anything at all, I am ignoring anchors and other complications at the moment).
And That’s it! That’s YOLO. Overly simplified, but that is the intuition. So to recap, The grid is just a scheme to consistently label our output and make it easy for the network to learn. The significance of the grid in the actual CNN of YOLO you ask? It is just that we end up with an output of s by s by (predictions per cell) so we can draw an analogy of the input image having been condensed into a grid of probability distributions.
Now coming to your question 1.
I can see why you would think if we want to condense an image to an s by s grid, we would just use a stride equal to filter size, but as you can see that is in no way what we actually do in the model. You are right, if we did what you did it would cause loss of data and it is necessary to consider those cases.
Coming to question 2:
If you read my intro para, I hope you can understand that each grid cell does not make a prediction based on only the pixels of that grid cell, since the CNN is being fed the WHOLE image, and it’s not that each cell is analyzing only that specific part of the image.
So even if an object slightly appears in a cell, the CNN is not making a prediction looking at just that portion. I encourage you to look up the “receptive field”. Basically what this means is (refer to the YOLO network above), they use a 7 by 7 grid output, so consider the 7 by 7 by 1th vector, that’s the output of the CNN for the possible detections/bounding boxes detected by the 1st cell. But this cell had inputs that come all the way back from the whole image. Hence, it is possible for this architecture to learn to predict bounding boxes even if something very small appears in the cell.
Although, it is very unlikely such a cell will produce an output that surpasses and gets through non-max suppression since there must be some other cell that covers a greater portion of the image which the CNN may have a higher probability output for.
Question 3:
The NN doesn’t exactly predict mid-points in the sense it is not an output of YOLO, midpoints are used to train the network, so as a human it makes sense to say that YOLO is in fact learning where midpoints might be and giving those cells a higher probability value output.
And If an object is slightly covered, that’s okay, YOLO, isn’t calculating midpoints, it’s just a network that learns to give predictions, so it will naturally predict a close enough midpoint. (This is what we assume the network is doing)