In weeks3, lecture “Convolutional implementation of sliding window”, Andrew explained how to save computational expensive, by running the network on the entire image instead of a small window in it.
I don’t understand how is it possible to run the network with 28x28 image if the inputs needs to be 14x14? Is it mean that the network architecture needs to be changed?
Can any1 explained the implementation details?
In sliding windows you cut up the original input image and pass the subregions into a classifier one at a time. The algorithm runs through a complete forward pass for each subregion. In a convolutional implementation, you pass the entire image in and instead slide the kernel over it during the convolutions. The convolutional neural net runs one forward pass only. Both approaches can produce the same number of outputs from the same original input. But because all the outputs are produced in parallel, instead of in series, the convolutional approach runs faster. It also has some other advantages regarding object number, location, and size, which you will uncover when you get to the YOLO discussions. And it’s a ‘yes’ to the question about network architecture needing to be changed. The input shape is the size of the entire training (or test) image instead of the subregion size, the kernel shape is the size of the filter you want to use (often an odd number of pixels, so maybe not 14 exactly) and your layers are now 2D convolutions (and pooling and activation etc).
Thanks for the reply.
I’d like to focus on the network architecture change.
So if we’re changing the network input shape (to be the whole image instead of small subregion), it means that we need to train the model on full image size in the first place, right? If so, isn’t it mean the the sliding window is just an automatic process as part of training of the first layers?
Maybe refer to the lecture video for the respective network architectures. In particular the section from about 6 minutes through about 8:40. Sliding windows takes a 14x14 region input, produces a single floating point value as output. To cover a 16x16 image, you would repeat that 4 complete times. The Convolutional network takes the entire 16x16 as input and produces 4 floating point values from its single forward pass. You can see that the outputs are equivalent, but how they produce them, the layer shapes and computations, are completely different. I don’t think it is correct or helpful to think of sliding windows as part of a Convolutional architecture. They are apples vs oranges , or maybe cats vs non-cats