The mask is comprised of a single value, the predicted class type, for each pixel location. It is extracted from the forward propagation output by selecting only the class of the highest probability prediction for each pixel (using argmax() )
You’re basically creating a multidimensional object in the shape (height and width) of the input image that contains just the encoding for the predicted class.
Right! The masks are images with only one value per pixel and it is on the 0 channel. One other slightly subtle thing to point out is that we are dealing with PNG files here, not the usual RGB files. In PNG, one of the options is to express images with 4 channels per pixel: R, B, G and A (Alpha). Alpha is used in some graphics applications, but all the images here have the A value as 255. It looks like imshow is sophisticated enough to render the 4 channel camera images without “slicing” them to eliminate that 4th channel. But with the mask images, “not so much” …