I’m having a hard time understanding Object Detection
Correct me if I’m wrong. Suppose we only have one class (car).
first We split the picture for 19 * 19 grid cells
And then we check the probability of each cell and get rid of any cell for which the probability is less than a chosen threshold.
Then we suddenly jump into these red boxes
is not clear to me how we jump from small cell to these big red boxes.
Where in the lectures does the first image come from? The second set of images showing how “non-max suppression” works is specific to how the YOLO algorithm works, which is the main object detection algorithm that Prof Ng describes, once he’s given the overview including earlier techniques like “sliding windows”.
There are lots of very detailed threads on the forum that explore how YOLO works and how it is trained. Here’s a good one to get started down that path. And here’s one that talks about non-max suppression and the role of anchor boxes (not the same as bounding boxes).
You need to be careful and precise about thoughts and assertions like this. It might be true for sliding windows, where you iteratively feed subsets of an overall image into a CNN, but it is not literally true for YOLO. The innovation of YOLO was that it does not split up the image, but processes the entire image as a single input in one pass.
This is also not quite accurate. Grid cell and object bounding box are not the same concept. When you’re filtering on a confidence threshold, you are pruning low confidence bounding box predictions, not grid cells. In YOLO, grid cells influence the network output shape and therefore the number of predictions being made, but the predictions themselves are of object bounding boxes. The grid cells are static and regions of an image they correspond to are never filtered or ‘gotten rid of’.
For the record, that image with colored cells is an example of image segmentation, where you are predicting the most likely class within a coarse boundary - the grid cell. That is not the same as predicting the bounding box of any specific object. For example, if there were two trees in a grid cell, that entire cell would be colored green, but it wouldn’t tell you anything about how many individual trees were there, or where within the cell the objects were located.
Your intuition is correct; that would indeed be hard to do. Luckily, that isn’t what object detection algorithms try to do 
As described in considerable detail in the threads @paulinpaloalto has linked above, YOLO treats each grid cell + anchor box as a standalone detector capable of making its own predictions regarding object presence, class, and bounding box location and shape. Because of this, it is possible that multiple detectors make predictions on the same object from the image. Non-max suppression is one method for filtering out, or suppressing the lower confidence, or non-maximum, predictions and keeping only the best one.
Hope this helps. Take a look at the related threads and let us know your thoughts.