What does it mean when it is said that the units in hidden layer cover small regions of the image? And units in later layer covers larger regions? In the lecture why each unit in layer 1 has nine images patches?
It all depends on the filter sizes, padding and stride values that you choose. Think about what happens in the very first hidden layer: each output value is the result of applying the given filter at a particular position in the input image, right? So how “big” or “small” that is depends on both the size of the input image and the chosen filter size. Generally speaking, what happens as you go through the successive hidden layers is that the height and width dimensions reduce as the channel dimension expands. The height and width reduction is the result of the filter size, padding and stride values and the pooling layers. So it should be intuitive that it means you are expanding the geographical area of the original input image covered by what each neuron “sees” as you go deeper into the network.
Thank you for your explanation. It helps to understand the intuition. Also wanted to ask what it means when we say as we move to deeper layers we are loosing spatial information? I asked this in relation to the skip connections that were explained in the U-Nets. As we move to deeper layers, how do we differentiate when we say increasing the coverage of image but at the same time loosing spatial information.
I think by “losing spatial information” they are just saying the same thing: since each tile of the image at the later layers reflects a bigger section of the source image, that means you have less ability to discriminate the location of the source data within the image. E.g you can only say it is in the upper right quadrant (or whatever). But this is just my interpretation FWIW …
The net result of this whole lecture is that the research paper being described here actually gives us a fairly concrete way to understand what it means to say that the later layers of the network are distilling down the “raw” input information from the image into recognitions of various high level features (e.g. the ear of a cat or a particular pattern in the image). We actually get to see the input pattern that triggers the strongest reaction from a given internal neuron.
I like what @paulinpaloalto wrote above. The way I think of it is that for a classification network, at the very end the information of the entire input image has been condensed into a single value: 1 / cat
So the deeper you go in the layers, the more ‘condensing’ has occurred. Or, conversely, the larger amount of the original input signal each remaining network element represents.
Also, be cautious about anthropomorphising those image patches, and saying ‘this layer must be looking at ears’ etc. The neurons only ‘see’ a bunch of floating point numbers that satisfy a constraint or inequality.
Thank you @ai_curious and @paulinpaloalto for your explanation. This really helps to build the intuition on what deep network is learning at each layers. This is how I am interpreting. Each units/neurons in the shallow layers are looking at smaller regions of images / smaller groups of pixels and are activated by certain features identified in these smaller regions. As we move into deeper layers, each neurons are covering larger regions of images ( larger number of pixels) and are activated by high level features. But however, since larger group of pixels are contributing to this high level feature, it is impossible to look back at the original images and point out which smaller group of pixels (or specific pixels) in the images is contributing to this high level feature. In other words, the spatial information (carried by each pixel or perhaps the relation between pixels in smaller regions) are lost. In U-Nets skip connection is used to capture this pixel to pixel information from shallow layers and merged with high level features to rescale the output volume back to its original input image volume.