So far my understanding is that we can use the SAME exact ConvNet that we applied independently to each window, and apply it once using the whole image and get the same answer. What was the point of explaining how to convert FC to Conv layers? Is that conversion responsible for making the convolutional implementation possible, or was that just for visualization purposes?
Let me rephrase my question:
What would happen if we didn’t convert any FC layers to Convolutional layers and applied the original ConvNet to the whole image? Would it work the same way?
Odds are slim that your network will perform well with a just FC layer.
Have you seen this video?
but we are still keeping the convolutional layers that were already there… It’s just a question of whether or not to keep the layer few FC layers as FC or change to convolutional. In the video ‘Convolutional Implementation of sliding windows’, Andrew first explains how to change the last few FC layers to convolutional. What was the point of that?
Seems like you’re referring to the cat detector. As the figure shows, the first few layers are Conv layers. The last 2 hidden layers are dense and the output layer is a single neuron.
There’s no hard and fast rule on the number of conv and dense layers you can introduce into a NN. You can experiment and find what works best for you. The purpose of the lecture is to show that having conv layers before Dense layers in the initial stages is a good idea.
I’m referring to video ‘Convolutional Implementation of Sliding Windows’. Why was it necessary to convert the last FC layers to Convolutional? Would we have gotten the same answer if we did not https://www.coursera.org/learn/convolutional-neural-networks/lecture/6UnU4/convolutional-implementation-of-sliding-windows
When using the Conv layer at the final stage, you can say for sure what each of the 4 outputs correspond to i.e. top left would correspond to the 1st sliding window.
Should you use a dense layer, this can’t be said for sure since it takes inputs from all of the previous layer outputs i.e. all sliding windows. So, in this case, using a conv layer at the end makes sense.
Okay, so you’re saying that we lose track of which output corresponds to which input. But does the computation still work? If we kept the FC layers in place, would they be able to adjust a bigger input size. For example, let’s say originally a 5x5x16 input is fed into a 400 unit FC layer. If we changed the input to 6x6x16, does that still feed into a 400 unit layer with a 5x5 filter, or would we need to manually increase the FC layer unit size to 1600 for the math to work correctly?
You don’t need to change the filter size. Watch https://www.coursera.org/learn/convolutional-neural-networks/lecture/6UnU4/convolutional-implementation-of-sliding-windows from 4:16 to see how the same filter size works on larger input images.
If you were to use a FC, your calculation of 1600 nodes per layer is correct.