Questions about dataset in "Image_segmentation_Unet_v2"

Dear friend, mentor,

I already finished the hw of Image_segmentation_Unet_v2. But I still have some questions in detail about the data. This may take you a little bit of time. Thank you in advance.

Q1. why data has 4 CH ? like the code below. The original img, mask all has 4 CH? I thought a regular pic only has 3 RGB CH?

N = 2
#img and mask size (480,640,4)
img = imageio.imread(image_list[N])
mask = imageio.imread(mask_list[N])
#mask = np.array([max(mask[i, j]) for i in range(mask.shape[0]) for j in range(mask.shape[1])]).reshape(img.shape[0], img.shape[1])

fig, arr = plt.subplots(1, 2, figsize=(14, 10))
arr[0].imshow(img)
arr[0].set_title('Image')
arr[1].imshow(mask[:, :, 0])
arr[1].set_title('Segmentation')

Q2, the original img and mask is (480,640,4). But this Unet has 23 classes. The mask has 4 classes ? So, my understanding is the training mask data has 4 classes, but the final unet outcome can do 23 classes? I am confused here.

Q3. If a pic is made from 3CH RGB, what does a mask make from ? I think is the #CH. maybe 1st ch segment the road, 2nd ch do tree, 3ch do sky… If you plot the code imshow(mask[:, :, 0]) , the mask looks “reasonable”, but if I plot all CH, like imshow(mask), this pic is black. If I plot other ch, but not 0, then the pic is just dark purple. I am not following the idea of all black (all CH), or all purple (CH1 or 2 or 3)

Q4. This is a silly question. lol. The “processed_image_ds” is the data (img, and mask). But which line of the code in the hw tells the U-Net, img is the data, and mask is the label ? It could be another way, right…

def process_path(image_path, mask_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_png(img, channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)

    mask = tf.io.read_file(mask_path)
    mask = tf.image.decode_png(mask, channels=3)
    mask = tf.math.reduce_max(mask, axis=-1, keepdims=True)
    return img, mask

def preprocess(image, mask):
    input_image = tf.image.resize(image, (96, 128), method='nearest')
    input_mask = tf.image.resize(mask, (96, 128), method='nearest')

    return input_image, input_mask

image_ds = dataset.map(process_path)
processed_image_ds = image_ds.map(preprocess)

train_dataset = processed_image_ds.cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
model_history = unet.fit(train_dataset, epochs=EPOCHS) 
#how did unet know, in the train_dataset, img is the data, and mask is the label? 

Q5. this is really a critical question, hope I can put it here. In real world, how do you make those mask, how do you create those training data? Let’s assume humans is having some super alien virus (very new, nobody has seen it before), now, I want to do this segmentation in x-ray pic to “draw” the area of infected area. So, how you gonna do it? Ask a human, and hand draws this segmentation one by one, pic by pic? That takes forever…

Note that the files here are PNG files, not JPEG or TIFF files. In PNG format, one of the options is to include the “alpha” channel, which has to do with transparency when you render layered images. In these particular images, the alpha channel is always 0 and can be ignored. Here’s another thread that discusses this. If you want to know more about PNG files, the search terms should be obvious.

That was discussed on that other thread I linked. If you actually look at the contents of the mask files, you’ll find that all the channel values are 0 except the first one (channel 0). So you’ll notice that the logic discards the other 3 values. That one value can assume any one of the 23 different possible label values.

In any situation, you need to understand the meaning of your data and how it was created. I can’t make any general statement about how people in general deal with masks in formats other than PNG (the example we see here). My guess is that the reasonable thing to do would be what they do here: put the mask labels on channel 0. If the question is how the mask files are created, that’s covered under Q5 below.

It’s the same answer as to Q3: you need to understand your data. There is no general rule or general format. You need to figure it out in your particular situation. Or if you are creating the data, then you get to decide. Sure, they could have given you the mask files first, but in all the examples we’ve seen in DLS so far, you have X and Y, where X is the input data and Y is the labels.

This is a hard and (as you say) critical question and I don’t know the answer. It doesn’t take much thinking to realize that creating these mask files is a very complex and work intensive task. There have been some earlier threads on this, e.g. this one and here’s one that talks about some research work in this area. Perhaps someone else listening here has already looked into this and can provide more information.

Hi Paul, thanks for your reply.
For Q2, I am not very following when you say “That one value can assume any one of the 23 different possible label values.”. Are you saying the CH0 of mask data(PNG) is good enough for Unet to learn 23 classes ? In other words, #ch in mask file has nothing to do with Unet output #class, right?

For Q4, I was not clear about my question. for example, in regular CNN, the data has 2 part, one is X data, another one is the Y label. We need to tell CNN which one is which, right ? So, in this homework, which line of the code tells the Unet during the training “img is the X data, mask is the Y label”. Thank you!

For Q2, yes, you only need one value number to represent the category for a given pixel. Note that this is like when you use softmax as the output in a multiclass classifier, although here it is happening at every pixel. The softmax output will be a 23 x 1 vector, so you’ll have to convert the categorical (single value) representation of the label on the pixel to the “one hot” representation in order to compute the loss, but you do that “on the fly”, meaning you always store the data in the “categorical” form because it takes 23x less space, right?

For Q4, note that it’s the same as always: the X data is used as input to the model and the Y labels come into play only when you get the output of the network (\hat{Y}) and then need to compare that to Y to calculate the loss. Where does that happen in the logic in the notebook?

I just compared it with the previous homework. I think the code dataset = tf.data.Dataset.from_tensor_slices((image_filenames, masks_filenames)) is doing this job. X data is image_filenames, Y data is masks_filenames. Is this correct?

Probably, but the question is what is done with the variable called dataset, right? That statement just says that dataset is an “iterator” that gives you two file names on every iteration: the first is an image filename and the second is the corresponding mask filename. So what do you then do with that iterator?

hmm…

ok. this is how I found out. I check the hw “Convolution_model_Application”. It has the code as below:

train_dataset = tf.data.Dataset.from_tensor_slices((X_train, Y_train)).batch(64)
test_dataset = tf.data.Dataset.from_tensor_slices((X_test, Y_test)).batch(64)
history = conv_model.fit(train_dataset, epochs=100, validation_data=test_dataset)
# hint: happy_model.fit(X_train, Y_train, epochs=10, batch_size=16)

So, when I see X_train, Y_train, I assume this is the place you let the NN know which one is the data, which one is the label. I also checked the reference tf.keras.Model  |  TensorFlow v2.11.0 , it shows that, the function


fit(
    x=None,
    y=None,
    batch_size=None,
    epochs=1,
    verbose='auto',
    callbacks=None,
    validation_split=0.0,
    validation_data=None,
    shuffle=True,
    class_weight=None,
    sample_weight=None,
    initial_epoch=0,
    steps_per_epoch=None,
    validation_steps=None,
    validation_batch_size=None,
    validation_freq=1,
    max_queue_size=10,
    workers=1,
    use_multiprocessing=False
)

So, I assume the x indicates the data, y is the label. In our case, the line tf.data.Dataset.from_tensor_slices((image_filenames, masks_filenames)) put the x and y together. Not sure if I answered correctly or not :frowning:

Yes, I think that’s the right idea. The place where the real action happens is the “fit()” method of the model. That’s where the training happens.

Notice that if you study the code in the Unet notebook, the dataset gets passed through two “map()” methods first before it gets handed to the “fit()” method. Those “map()” calls are just invoking preprocessing functions which preserve the format of the return values: the first one is the image file and the second one is the mask file.

The first function is process_path. Check what that does in addition to reading the files. That’s where the 4 channels that we were discussing before gets handled for both the images and the masks.

Then the next function called through “map()” is preprocess, which resizes the images to the expected size.

1 Like

thanks for the details again.