Do we scale down first to decreases the number of parameters in U-Net?

In U-Net, first we scale down the dimension of the image increasing the number of channels, then again go back to the same dimension as that of the image . The number of pixels remain the same. Here, each pixel is classified into number of classes in the classification algorithm.
My question here is, why cant we just do a direct learning without first scaling down? Is it because we want to decrease the number of learnable variables??

Sure, it’s fine to consider alternative solutions. So how would you implement the “direct learning” approach? The point is we need to analyze the image and detect the shapes and patterns in it, right? How do we go about recognizing a car or a curb or a tree or a roadsign in the image so that we can “label” it correctly?

The approach that U-Net takes to this is to create the equivalent of a convolutional “classifier” as the “downscaling” path. So think of that as doing the identification and pattern recognition that I described above. Note that it’s a different task than the simple “is there a cat in this picture” kind of classifier in that it needs to be a lot more fine grained than that. Meaning that the downscaling path does not go all the way down to a fully connected layer and then a single neuron for the “yes/no” answer. It goes down only as far as a complex enough state to represent all the features in the image and their location and shapes. Then the “upsampling” path takes that classification information and uses the “skip” layers also to reconstruct something of the same size and shape as the initial input image, but with the pixels labelled by what type of object they are part of.

1 Like

Yes, I understand the problem now. Thanks a lot.
Just clarifying:
So the downsampling somewhat helps in classifying and finding the presence or absence of an object, and the upsampling helps in localizing the object in the original image at a pixel level.

Yes, that sounds like a good way to say it: the upsampling is mapping the classification data back into the form of the original image as labels on the individual pixels.

1 Like

Of course the high level point here is that we are training the network to be able to perform this complex task by using labeled images. All the coefficients for the downsampling path and the upsampling path will be learned during the training. And the only reason we even know this will work with the given architecture is that the authors of the paper have done the work of coming up with this architecture through thought and experimentation and then proving that it worked at least with the type of data that they used for the training.

It is possible that there are other network architectures that could learn to perform the same task. So if you have some new ideas, you can try them out and if it works write a paper about it! That’s how Science works: if you have a theory, you need to find experiments to prove that it is either true or not true and then “show your work” so that other people can reproduce your results.