The loss function to use in a given case is always a choice that you need to make. Whether you use a distance based loss (e.g. MSE or MAE) versus a “cross entropy” function depends on what your output represents: either a continuous real number or something like a probability distribution (more common in “classification” problems).

The case of YOLO is interesting in that the model is outputting a number of different types of outputs, including bounding boxes and classifications (pedestrian, car, tree, stop light …). So you may need a “hybrid” approach to the cost function in which you are adding terms for each aspect of the output. Maybe the bounding box needs something like MSE whereas the classification of the contents needs something more like cross entropy. I think that is what Prof Ng was getting at in that section that you quote.

Here’s another recent thread that just talks in general about the differences between distance and entropy style loss functions, but not specific to YOLO.