Here is how I differentiate the two approaches:
Sliding windows subdivides the input image. YOLO does not.
Sliding windows runs forward propagation once for each subdivision. YOLO runs forward propagation once.
Sliding windows can only locate an object within the one subdivision input to forward propagation at a time. Therefore, objects that are larger than the input image subdivision or that overlap subdivision boundaries cause issues. Since YOLO does not subdivide the input image, it doesn’t matter how big or exactly where the objects are within the image. (Note, the image input to the original YOLO was 448x448 pixels. YOLO 9000 used 416x416 pixels. The 7x7 or 19x19 we talk about is the number of grid cells, not the number of pixels input to the CNN. Each grid cell covers hundreds to thousands of pixels)
Sliding windows detects one object per forward propagation. YOLO detects (grid cell count * grid cell count * anchor box count) objects per forward propagation (eg 19 * 19 * 5)
That YOLO could detect 1,800 objects significantly faster due to the single forward propagation, handle randomly positioned objects, and still deliver competitive accuracy is why it was so important.