I haven’t yet had time to read the full article that Reinoud has given us, but it looks excellent and gives detailed explanations of the differences between the two approaches.
But the short answer is that in the object detection case, you’re not trying to reconstruct a labelled version of the input image, right? That is what requires the skip connections in the Semantic Segmentation case. In Object Detection and Localization the location information is dealt with in a different way: it is expressed by the bounding boxes that are part of the output. That’s a fundamentally different approach. YOLO is very deep waters. If you want to understand that in more detail, you should go through some of the excellent explanations put together by ai_curious on these forums. Start with this post and the earlier ones that it links to.