U-Net: Combining Final Output Into a Single Image

At the very end of U-Net, we get a vector that is h * w * n_c where n_c is number of classes. We combine each channel into a single image to get the segmented output.

How exactly do we combine the channels? There is something said about taking the max

Can someone please elaborate on this? Do we give priority to pixels in the training set? Let’s say we want the car to be at the front so we assign car the highest pixel activation value and just take the maximum activation value of the pixel across all channels?

Am I correct in my understanding?

As far as I remember every pixel of the image belongs to a class, there is no priority list. In the training phase you have images and segmentation maps where the model learns to classify the pixels in the right class according to the segmentation map.

The arg max here means take the class with maximum probability that comes of the model for each pixel, and that class is assigned to that pixel ultimately.

1 Like

Right! The output for each pixel is a softmax distribution across all the possible classes. That is how the prediction is expressed. To translate that to a categorical class, we just need to take the argmax. That’s how it always works in multiclass classifiers, but the new and salient point here is that we’re classifying every single individual pixel in the image, instead of the usual image level classification.

Well, we actually do the usual from_logits = True mode on the loss function here, so the prediction outputs of the model as written are the raw inputs we could feed to softmax manually. But softmax is monotonic, so just taking argmax of the logits gives you the same answer. It’s up to you which way you choose to implement that.

2 Likes