U-Net | Why in U-Net architecture we used "Transpose Convolution" instead of "1*1 convolution" to decrease the number of channel?

Nurlan_Imanov1 · December 10, 2021, 5:49am

As you see in the below picture while Andrew Ng illustrates the “U-Net Architecture” he uses “Transpose Convolution” and states that he used it for decreasing the number of channels. But in the lectures, we learned that “Transpose Convolution” is used for increasing height and width whereas “1x1 convolution” is used for decreasing the number of channels. If it is the case why Andrew explicitly uses “Transpose Convolution” in that part of the “U-Net Architecture”. Should it have been " 1x1 convolution" instead of “Transpose Convolution”? P.S.: “Transpose Convolution” is denoted as a green arrow and as you see in the subtitle Andrew says that we used to for decreasing number of channels, but we normally learned that “1x1 convolution” is used for that purpose not “Transpose Convolution”

paulinpaloalto · December 10, 2021, 7:38am

The point is that what we need in the “upsampling” side of U-net is not just the decreasing of the number of channels: it is explicitly about increasing the height and width dimensions. That is why 1 x 1 convolutions would not do what is needed here and transpose convolutions are what is required.

Nurlan_Imanov1 · December 10, 2021, 12:43pm

Yes, I have already understood why we need to use “Transpose Convolution”. But especially in that part, Andrew Ng used “Transpose Convolution” for decreasing the number of channels while keeping height and width constant, but normally we use “1x1 convolution” for the mentioned purpose. As you said we use “Transpose Convolution” for increasing height and width not decreasing channels number, but why Andrew Ng used “Transpose Convolution” to decrease channel number in this case instead of “1x1 convolution”?

paulinpaloalto · December 10, 2021, 3:27pm

Yes, so we’re done here, right? You’re just restating the point that I was trying to make in my previous reply.

Nurlan_Imanov1 · December 10, 2021, 5:10pm

No answer is not clear yet. In your first reply, you emphasized that “we need ‘upsampling’ so we have to use ‘Transpose Convolution’ rather than ‘1x1 convolution’ because ‘upsampling’ means increasing height and width so ‘Transpose Convolution’ has to be used” which is clear for me, but it is not what I have asked.

My question is: as you mentioned ‘Transpose Convolution’ is used for upsampling means we use it when we need to increase height and width. However, in the part that I have circled by using red line in the image, Andrew Ng used ‘Transpose Convolution’ for decreasing the number of channels and used that sentences while explaining that part:

Andrew Ng: " So we’re going to start to apply transpose convolution layers, which I’m going to note by the green arrow in order to build the dimension of this neural network back up. So with the first transpose convolutional layer or trans conv layer, you’re going to get a set of activations that looks like that. In this example, we did not increase the height and width, but we did decrease the number of channels. "

So as you see Andrew Ng also mentioned that he used ‘Transpose Convolution’ for decreasing the number of channels, not for increasing height and width.

What I want to say is, normally in previous lectures we learned that we use “1x1 convolution” for decreasing the number of channels, and also we learned that ‘Transpose Convolution’ is used for increasing height and width. However, in the part that I am mentioning Andrew Ng used ‘Transpose Convolution’ for decreasing the number of channels therefore I was confused. Because we have learned ‘1x1 convolution’ is used for that purpose not ‘Transpose Convolution’.

So what I say is when we want to decrease the number of channels we have to use ‘1x1 convolution’ not ‘Transpose Convolution’. But in that case ‘Transpose Convolution’ is used for that purpose, and I am asking why?

paulinpaloalto · December 10, 2021, 7:00pm

The point is that transposed convolutions and forward convolutions are not the same thing, right? It’s not just a question of the amount of output data that they create: what they are doing is different. So they aren’t just interchangeable based on what you want to do with the dimensions. Note that with a forward convolution, it is a hyperparameter how many filters you specify, so the number of channels can increase or decrease. It just depends on what you are trying to accomplish. And you can use “same” padding and stride of 1 to preserve the height and width or you can reduce them.

I’m not sure what Prof Ng means at that point in the lecture. If you wait until you get to the U-Net programming exercise and see what actually happens when “the rubber meets the road”, what you’ll see is that each “upsampling” layer starts with a Transpose Convolution that actually does increase the height and width, followed by concatenating the “skip” layer output, followed by two forward convolutions which preserve the height and width and decrease the channels.