Convolutional layers vs sliding window detection

In video the “convolutional implementation of sliding windows”, Why do convolutional layers save computation over sliding window detection? It seems like in both cases we are cropping the bigger image and passing it through a neural network.

As the lecture said roughly starting from 8:45, it shares a lot of computation. For example, computing on the same thing four times is more expensive than computing it once. You save time by saving us from computing the same thing for three more times.

You might imagine a model that is supposed to accept an 6x6 input, then to draw a simple 8x8 data, draw a simple 2x2 filter, and then count the number of convolutions in each of the two approaches. Then you should see the sliding window approach has less count. Feel free to show me your drawing and counting if you still have questions.