"Black" mask PNGs in Image segmentation (Image_segmentation_Unet_v2)

Hello,

In this exercise, the train/test image PNGs are located in data/CameraRGB and the image masks are in data/CameraMask.

If I open a PNG from CameraRGB, I can see a sample image from this driving dataset. However, if I open the corresponding PNG from CameraMask e.g. 000026.png, it appears as a “black” image. It appears that there is an extra layer in the mask images, and in the notebook, we look under that layer to read that mask.

Why are the masks created like this?

–Rahul

I’m not sure why they did it that way. It’s apparently one of the choices in PNG format that you can include a 4th “alpha” channel in addition to RGB. But in the datasets here, the alpha value is not useful for anything. All the alpha values are 255. For the masks, only the first color value (the R channel) is useful. You can see that they select channel 0 in the logic in the notebook.

I removed the alpha channel from the masks and save the mask files in a new “CameraMaskNoAlpha” subdirectory, like this:

for mask_img in mask_file:
    mask = imageio.imread(mask_path + mask_img)
    plt.imsave("./data/CameraMaskNoAlpha/" + mask_img, mask[:, :, 0])

I updated the mask_list_ds and masks_filenames vars to point to this subdirectory.

I then ran this notebook, but the fit() method gives NaN for the loss and 0.00 for accuracy:

model_history = unet.fit(train_dataset, epochs=EPOCHS)
(TensorSpec(shape=(96, 128, 3), dtype=tf.float32, name=None), TensorSpec(shape=(96, 128, 1), dtype=tf.uint8, name=None))
Epoch 1/40
34/34 [==============================] - 17s 508ms/step - loss: nan - accuracy: 0.0000e+00
Epoch 2/40
34/34 [==============================] - 4s 105ms/step - loss: nan - accuracy: 0.0000e+00
Epoch 3/40
34/34 [==============================] - 4s 104ms/step - loss: nan - accuracy: 0.0000e+00

So maybe the apha channel is getting used somewhere? Without this change, my notebook runs fine.

Also, what does the tf.math.reduce_max() in the following method do?

def process_path(image_path, mask_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_png(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)

    mask = tf.io.read_file(mask_path)
    mask = tf.image.decode_png(mask, channels=3)
    mask = tf.math.reduce_max(mask, axis=-1, keepdims=True)
    return img, mask

That chunk of code is the reason that you did not need to go to all the trouble to create your own mask files with the alpha stripped out. Look at what the logic does:

First it extracts 3 channels from the masks, which eliminates the alpha channel.

Then it takes the max across the remaining RGB channels to get the mask value. It turns out it’s stored in the R channel, so they probably could have written that line more simply.

@rvh @paulinpaloalto

I know this question is already one year old, but it is the most related to what I was looking for. My question applies to the code snippet (process_path()) as cited by @rvh, which is exactly the same in the current notebook.

I am very confused about the creation of the masks; I really wish that was explained a bit more clearly in the notebook.

I have pieced the following together from this post and another one; can you please confirm that this is correct:

  • A fourth ‘alpha’ channel has been added at the end of the RGB channels, i.e., to axis 2 of the image. So we have images of dimension (h, w, 4).
  • When preprocessing, this fourth channel is simply ignored by setting the channels argument in tf.image.decode_png() to 3; i.e., this function only looks at the first three channels of the image. (I actually couldn’t find the function in the TF apidocs, but discovered that it has been moved to tf.io.decode_png().)
  • Finally, the max value along the last (-1) axis of the image (i.e., axis 2) is taken for each pixel - meaning the max value of R, G, and B. This is the value that that pixel in the mask will have. But apparently these max values are always stored in the R channel, so the pixel colour will always be some shade of red? (I guess it doesn’t matter whether it’s a shade of R, G, or B - as long as each class has a different colour?)

Is this correct so far?

Then I am also wondering why it is necessary to set keepdims in tf.math.reduce_max() to True. I don’t understand the explanation of that argument in the TF apidocs: “If true, retains reduced dimensions with length 1.”

Thanks!

It’s fine to add on to an old thread, if the topics are relevant, although sometimes people won’t notice unless they are “following” the previous thread. So maybe the best strategy is to try it and if you don’t get any response, then create a new thread and (optionally) link to the old one. Of course tagging someone using @ as you did is also a way to make sure it gets noticed :nerd_face:

Yes, I think that’s all correct.

As I explained on those other threads PNG format adds the “Alpha” channel, which is used to express transparency in some graphics applications. But they don’t really use it here and the logic provided in the notebook strips it out.

We are using a fairly old version of TF 2.x here, so you have to look for the documentation of the matching version or you may find that things have moved around or even actually changed in definition. The world of TF mutates pretty quickly. Try printing tf.__version__ to see which documentation version to look for, if you want coherent results.

It’s a good question how the masks end up rendering as nice simple close to primary colors across the spectrum instead of just shades of red. It looks like the rendering algorithm is somehow smart enough to recognize this as a standard “use case”. If you want to go deeper, more research is needed there. You could try some experiments, e.g. scaling the label values by dividing them by 255. or 22. and see if they end up rendering as grey scale or shades of red.

On the purpose of keepdims there, it should be pretty easy to figure out the point of that: try it both ways and watch what happens. If you use the default keepdims = False, you end up with a 2D tensor of shape h x w and in the actual case you end up with a 3D tensor of shape h x w x 1. The two are logically equivalent from a mathematical point of view, but the algorithms dealing with images expect there to be a “channels” dimension even if it’s trivial, meaning of length 1. I tried removing the keepdims and here’s the error I get from that code block:

ValueError: 'images' must have either 3 or 4 dimensions.

The 4D case is when the first dimension is the “samples” dimension, of course, meaning it’s m x h x w x c.

Thank you so much for your extensive reply!

The world of TF mutates pretty quickly. Try printing tf.__version__ to see which documentation version to look for, if you want coherent results.

I keep forgetting the version changes, thanks for the tip :slight_smile:

It looks like the rendering algorithm is somehow smart enough to recognize this as a standard “use case”. If you want to go deeper, more research is needed there. You could try some experiments, e.g. scaling the label values by dividing them by 255. or 22. and see if they end up rendering as grey scale or shades of red.

This is a good idea, I will try it out if I have some time left.

If you use the default keepdims = False, you end up with a 2D tensor of shape h x w and in the actual case you end up with a 3D tensor of shape h x w x 1.

OK, now I get what keepdims does, and why to retain the dimensions. Makes perfect sense. Then I must say that the explanation in the apidocs is just confusing!

Thanks again for all the clarifications :slight_smile: