U-Net Architecture "up-conv" operation

The above is the link regarding the topic.

My question is, that in the orginal U-Net paper, the up-conv operation is defined as follows under the Section 2, labelled Network Architecture, “Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the
number of feature channels”.

So the up-conv operation is upsampling+2x2 Conv2D operation. Why is then everywhere it is taken to be , or assumed to be, Transpose Conv2D operation??
Please Help

It is admirable that you actually consulted the paper. But in that section it’s just a very high level verbal description. We can’t really see how they implemented it. There is a tar file available on their website that apparently includes the implementation, but it sounds like the code might use MATLAB.

Doing a google search for “upsampling vs transpose convolutions” gets a number of hits. Here’s an article on the subject from Jason Brownlee’s Machine Learning Mastery website. He points out that these are two ways to accomplish a similar goal.

Note that the paper was published in May 2015 and this course was published in late 2017 and then I think the U-Net material actually didn’t get added until they did a major rewrite of the course in April of 2021. So it’s also possible that the general thinking on how to implement U-Net got updated over that 6 year period. 6 years is a long time in this space: things don’t hold still for very long typically.

1 Like

I don’t think I remember Prof Ng discussing upsampling in the ConvNets course here. Reading a little bit further in the Jason Brownlee article, we learn that upsampling is the reverse of pooling layers. It just duplicates rows and columns to increase the size of the tensor. Just like pooling layers, upsampling layers have no trainable parameters. That is not very flexible. A transpose convolution has learnable parameters that modify the generated results, so that is a more powerful and flexible way to implement the “up conv” path. Maybe that’s why they do it that way now: it’s actually a more powerful method and thus an improvement over the original architecture because it is easier to train.

@Syed_Hamza_Mohiuddin @paulinpaloalto I may be saying something dumb here, but in my mind the most important thing is at every step of the way recall you have your residual ‘cross’ or skip connections or whatever you wish to call them.

Without that, I am not sure how this model would work at all.

You’d be sinking data in compression… With no way to recover it again, even in a different format.

1 Like

Sure the “skip” connections are absolutely critical in this architecture, but there are literally two separate paths in play at every level: the skip connections (which help reconstruct the geometry of the original image) and then the down sample and up conv paths, which do the other critical operation here of actually identifying the objects in the original image and labelling the pixels with the object types. The question here is about how the “up conv” part of it works.

@paulinpaloalto I agree, my point was just you can’t do the up without the matching slide in from the down. The questioner did not ask about that part, but it seemed relevant for me to note.

This is not a ‘standard’ face forward conv net.

1 Like

It’s also worth noting that it is completely common and “flavor vanilla” that algorithms get improved over time from what was in the original paper. As things get deployed and used at scale, people come up with improvements. E.g. how many versions of YOLO have there been now, since the original 2015 paper?

Another interesting and nice clear example of such an improvement is “inverted” drop out. Go back and read the original Srivastava, Hinton and Sutskever paper that introduced drop out and notice that they hadn’t thought of the “inverted” idea yet, so they had to “downscale” the weights at “inference” time. The way it is done now where we multiply by 1/keep_prob at training time to rescale the expected values is just so much cleaner and simpler. I’ll bet Hinton does a big Homer Simpson “D’oh!” every time he remembers that oversight. :laughing: Here’s a thread that discusses this point in more detail.

1 Like

@paulinpaloalto I concentrate on other things these days but I think they are on YOLO V9/10 :grin:

1 Like

Hi paul, thanks for always helping us out. Thanks to @Nevermnd to you too for your insights.
I had already read your reply but was failing to log in, and then I forgot about it. I was actually experimenting to make original paper work.
In the actual paper, the “up-conv” operation is upsampling+plus Conv2D operation. This introduces learnable parameters.

I originally asked this question, because I had been stuck on this for at least a month. So I started looking for implementations online, and every single one was using Transpose Convolution, so I thought maybe I misunderstood the paper. I am a very beginner in reading papers. I then trasnposed Convolution, but still failed.

Long story short, the problem turned out be that due to “valid” padding and cropping of encoder outputs when concatenating them with dcoder inputs, significant information was being lost.
I solved the by first making input image smaller, and then mirror pad with large padding values. This ensured that information was preserved even after cropping.

Here’s the final solution if anyone’s interested.my-unet-implementation

1 Like

I have not looked at your implementation, but we built an implementation of U-Net in DLS C4 W3 that uses transpose convolutions and it does not require the innovative application of mirror padding that you describe. Have you compared your solution to what we built in that exercise?