In the lecture video U-Net Intuition, we are having couple of questions. Can you please help to clarify ?
What is high contextual information of image ? How it differs from low level feature of an image ?
Here is my understanding about the need of skip connection. Please correct me if im wrong. the final layer got high context information from previous layer which means Network learn cat regions of the image very well but those regions having lower resolution , detected cat image region not very much clarity so by using skip connection we are going to achieve higher resolution of image . Am i right sir ?
The whole point of U-net is that the goal is to reconstruct exactly the shapes of the original image, but “painted” with their semantic labels instead of whatever colors they happened to be in the original image. The “downsampling” path just consists of normal “Conv” layers, so you lose all the high resolution spatial information and convert it into “feature recognition” information. That’s what Conv layers do, right? So how are you going to preserve or reconstruct the structures and shapes of the original image in the final output? That is what the skip layers give you.
We’ve talked about that specific point before, haven’t we? The point is that as you go through the layers of a Convnet, the height and width dimensions reduce, unless you always and only do “same” padding. And there usually are pooling layers which also reduce the height and width dimensions. That means that what influences the output of a given neuron deep in the network is coming from a larger geographical area of the input image. Let’s suppose we start with 64 x 64 pixel images. Now think about what happens at the very first layer with a 3 x 3 filter: you know within 3 pixels where whatever influences the output of a given neuron is within the image, right? It’s in one very specific 3 x 3 quadrant of the input image. That is precise spatial information: you know where something is in the image. Now think about what happens late in the network when the height and width is reduced to, say, 2 x 2: all you can say is that what influences the neuron at the 0, 0 position of that layer came from somewhere in the upper left quadrant of the input image. That’s an area that’s 32 x 32, right? So that’s a lot lower resolution spatial location: 3 x 3 is more precise than 32 x 32. This is not a deep or subtle point. Or think about the case in which your network is just looking for a yes/no answer about whether there is a cat in the image or not. So the output of the final layer is 1 x 1. If it’s “yes”, then all you can say is that there is a cat somewhere in that 64 x 64 image, but you have no idea where. So you have completely lost the spatial information that is contained in the input image. I think ai_curious gave you exactly that example a couple of days ago.