DLS Course 4 - Week 3: U-Net Image Segmentation Assignment

Hi All!

I’m referring to subsection 2.2 of the Image Segmentation assignment, where in the definition of the process_path function there’s a line whose purpose I don’t understand:

mask = tf.math.reduce_max(mask, axis=-1, keepdims=True)

What exactly are we trying to achieve here? I am trying to have this U-Net implementation working with a different dataset (consisting in png images with 3 channels instead of 4 like in the assignment), but something goes wrong here.

Any help would be appreciated!



Look at the shape of mask before and after that statement. It turns out that the inputs here are PNG files and they have 4 channels: RGBA. But for the mask values, only one of the channels has a non-zero value. So that reduce_max just gives you one output channel with the actual mask value.

If your data is in some other format like JPEG your mileage may vary, although even if your masks are prepared in advance with only one channel to hold the mask value, that logic will do no harm. Other than wasting some computation and creating some potential confusion. :nerd_face:

Oh, I see, thank you! So in case we’re working with a dataset where the masks are defined over 3 RGB channels, would it make sense to set the previous line as:

msk = tf.image.decode_png(msk, channels = 1)

in order to have one grey-scale channel for the masks? Or do I have to adapt the architecture to the masks I’m using?

Yes, it looks like that would work, but you might want to try it and make sure it does the same thing as the reduce_max logic that was shown. Just to make sure their definition of greyscale doesn’t involve any other transformations. You want the mask values to be the “labels” for the types of objects in the image, right? So they should be predefined integer index values.

You can look through how the masks are handled in the rest of the code. It appears that they are handled as one channel images everywhere. Of course the point is you want to train your model to produce those masks as output.

I should say that I have no expertise or experience here beyond what they show us in this notebook. Note that training such a network is going to be computationally expensive.

Doing so seems to do the trick, but the performance is rather poor. I’ll have a look into what’s the problem! Again, thank you!