I haven’t yet had time to read the full article that Reinoud has given us, but it looks excellent and gives detailed explanations of the differences between the two approaches.
But the short answer is that in the object detection case, you’re not trying to reconstruct a labelled version of the input image, right? That is what requires the skip connections in the Semantic Segmentation case. In Object Detection and Localization the location information is dealt with in a different way: it is expressed by the bounding boxes that are part of the output. That’s a fundamentally different approach. YOLO is very deep waters. If you want to understand that in more detail, you should go through some of the excellent explanations put together by ai_curious on these forums. Start with this post and the earlier ones that it links to.
Who says it isn’t? Joseph Redmon, for example, writes …
Our new network is a hybrid approach between the network used in YOLOv2, Darknet-19, and that newfangled residual network stuff. Our network uses successive 3 × 3 and 1 × 1 convolutional layers but now has some shortcut connections as well…
One observation I make is that semantic segmentation only supports a single classification per pixel. YOLO object detection allows multiple objects of different shape to be superimposed…the classic example is person standing in front of a car. Depending on your use case, that may or may not be significant.