Note that the files here are PNG files, not JPEG or TIFF files. In PNG format, one of the options is to include the “alpha” channel, which has to do with transparency when you render layered images. In these particular images, the alpha channel is always 0 and can be ignored. Here’s another thread that discusses this. If you want to know more about PNG files, the search terms should be obvious.
That was discussed on that other thread I linked. If you actually look at the contents of the mask files, you’ll find that all the channel values are 0 except the first one (channel 0). So you’ll notice that the logic discards the other 3 values. That one value can assume any one of the 23 different possible label values.
In any situation, you need to understand the meaning of your data and how it was created. I can’t make any general statement about how people in general deal with masks in formats other than PNG (the example we see here). My guess is that the reasonable thing to do would be what they do here: put the mask labels on channel 0. If the question is how the mask files are created, that’s covered under Q5 below.
It’s the same answer as to Q3: you need to understand your data. There is no general rule or general format. You need to figure it out in your particular situation. Or if you are creating the data, then you get to decide. Sure, they could have given you the mask files first, but in all the examples we’ve seen in DLS so far, you have X and Y, where X is the input data and Y is the labels.
This is a hard and (as you say) critical question and I don’t know the answer. It doesn’t take much thinking to realize that creating these mask files is a very complex and work intensive task. There have been some earlier threads on this, e.g. this one and here’s one that talks about some research work in this area. Perhaps someone else listening here has already looked into this and can provide more information.