In the picture (or video U-Net Architecture Intuition) . prof firstly said, For layer 5 (red label), layer 4 provides "the high level, spatial, high level contextual information. , " then prof also said that, for layer 4, ”But what is missing is a very detailed, fine grained spatial information. Because this set of activations here has lower spatial resolution to heighten with is just lower., “ At the end of the video, still about this layer 4, prof says, ”lower resolution, but high level, spatial, high level contextual information, as well as the low level. “
Q1, after all, the spatial information from layer 4 only is high or low, I am confused here.
Q2, Then, prof talks about the link. I feel like, because spatial information is not good enough (low) from layer 4, that’s why we need a link directly from layer 1 which has lower feather, but higher spatial information. Is this correct?
We can describe Layer 1 as having the lower-level, textural-like, or higher spatial resolution information as it is closer to the original image.
In contrast, we can describe layer 3 as having the higher-level, contextual-like, or lower resolution information as it is deeper into the neural network.
For your Q1 & 2, layer 4 has relatively lower spatial resolution than layer 1, and that’s why we want the skip connection to bring that to layer 5.
because layer 4 doesn’t have enough spatial information (even pic has been enlarged after some layers), that’s why we give a link directly from layer 1 to 4, since 1 has that strong spatial information. just double check, right ?
so, in that question. I want to ask is that, since we need high spatial information. why the link is not started from the very 1st layer but the last layer of downsampling?
If i just look at the slide, the first thing to my mind is about the shape: we want to connect two things that have the same height and width because this is a condition for a valid stacking of two 3D matrices. It seems to me that your choices (blue) might not satisfy that condition.
I understand your choice should have even more spatial information but the slide’s choice should at least has more (again we want to compare) spatial information than the previous layer of the skip-connection’s destination.
So, we want to choose a layer that has more spatial information, your choice and the slide’s choice are fine, but maybe only the slide’s choice has the same height and width.
You might want to verify about the height and width, and I can’t do it now because I am actually ready to sleep It’s midnight in my timezone.
Exactly. It only works to pass the geometric information between the steps where the shapes match. The diagrams in the assignment notebook are really clear and (although it’s been a couple of years since I watched the lectures on this) I’m sure Prof Ng explains this in the lectures as well. The whole point of the U-net architecture is that you have two paths:
The “downsampling” path, which looks like a pretty normal Convolution Network. It’s analyzing all the geometric information and distilling it into the distinct objects and the identities of those objects. It does that in a number of layers with decreasing geometric information and increasing semantic information as you proceed through the layers. You can see the height and width of the layer outputs decrease as the number of channels increase, as is typical in that kind of ConvNet.
Then you take the final results of the downsampling path and feed it back into the “upsampling” path. The job there is to reconstruct the geometry of the original image, but with the color values of the pixels replaced by the semantic identification of the object that contains each pixel. So the upsampling path goes through several steps of “reinflating” the image using transposed convolutions and those steps mirror the process from the downsampling path in terms of the shapes as they grow from the minimal semantic base to the full image again. So what the “shortcut” or “skip” layers do is feed the corresponding level of geometric information from the matching steps of the downsampling path to the corresponding level of the upsampling path. That’s how the step by step process can be successful at reconstructing the labelled images.
I think what you should do now is actually go back and watch the lectures on U-net again. I’m sure Prof Ng covered everything that Raymond and I just explained.