I don’t quite understand

Let me talk about what I understand.

For example: after CNN. Find out whether each grid detection is the probability of a car, and then take the grid with a higher probability as the center point. Do K-means for this grid? Then determine the width and height according to the clustering results? Then IOU?

I’m talking about another idea of mine

CNN has finally shrunk the image to a very small size, with the car features as the midpoint and the original image as the border when zoomed in?

Is my understanding correct?

What should I have missed?