Week3 Convolutional Implementation of Sliding Windows Video

In week 3, within Convolutional Implementation of Sliding Windows Video, Andrew describes how to convert fully connected layers into convolutional layers.

In later videos, the motivation for this becomes clear when we do this for multiple “slides” that some of these computations are shared leading to computational efficiencies.

However, if there was only 1 window i.e. 1 pass, does convolutional approach offer computational benefit over the Fully Connected dense layer approach?

Were you able to find the answer to your question?

Bump, same question here. It seems that the number of multiplications for both cases (conv vs FC) are the same for the 1x1x400 window size.

My take is when we have larger windows, instead of having multiple passes for each window, the convolutional approach allows us to process them in 1 pass. That’s probably where some of the speed ups come from?

In OverFeat:
Integrated Recognition, Localization and Detection
using Convolutional Networks, (Sermanet et al)
, that seems to be the case

During training, a ConvNet produces only a
single spatial output (top). But when applied at test time over a larger image, it produces a spatial
output map, e.g. 2x2 (bottom). Since all layers are applied convolutionally, the extra computa-
tion required for the larger image is limited to the yellow regions. This diagram omits the feature
dimension for simplicity.

Sorry to tag you directly, do you have any insights on this Paul? @paulinpaloalto