On the left half of the image, we see that when we do a normal convolution, the number of channels increases. However on the right half of the image, each normal convolution keeps the spatial dimensions as well as number of channels the same. So on the right half are we doing same padding convolutions with number of filters used equal to number of channels in the input representation?
No, on the right side of the U-net architecture, we are in “expansion” mode where we need to get back to the initial image size, but with the labels incorporated. So on the right side, we are using transpose convolutions, not normal convolutions. Transpose convolutions are essentially the “inverse” of normal convolutions and they expand the geometric size of the output.
On the left side (“downsampling”), we are doing normal convolutions, but also passing the output straight across to the “upsampling” phase through the “skip” connections to make it easier to reassemble the original geometry but now with the per pixel labels.
This was covered in the lectures and we’ll get to see the full details when we do the U-net assignment.
@paulinpaloalto - but we are using both normal convolutions as well as transpose convolutions on the right side. The green arrows represent transpose convolutions and the black arrows represent regular convolutions. We do a T-CONV followed by a couple regular CONV.
Ok, it was not at all clear what your question actually was in the initial post. Did you actually look at what happens in the upsample block in the code? You can see that there are the following steps:
- The transpose convolution doubles the geometric size and reduces the number of channels to the desired output number.
- You concatenate the “skip” layer output so you get a lot more channels.
- Then you do 2 normal convolutions with stride = 1 and same padding that have the same number of output filters that you had as the output of step 1.
That’s just one step in the “upsampling” path of course.
If the question is why are the conv2d layers necessary on the upsampling path, I don’t really know. At a simplististic level, you need to reduce the channels after concatenating the “skip” output. Take a look at the example test case they have for the upsampling_block
function. They need to reduce from 160 channels to 32. So maybe the theory is that the two conv2d layers do that process of capturing all the info in the extra channels and integrating it into the 32 output channels.
The rule here is “it either works or it doesn’t”, right? The researchers who published the paper must have done some experimentation and figured out that this combination of operations works well.
This answers my question. My question was how many filters we used in these.
Yes that makes sense.
Thanks so much @paulinpaloalto for answering this. I get it now.