Simple case: cats and dogs example
Input shape (150,150,3) → CONV layers… → Flatten() → Dense(512, activation= ‘relu’) → Dense(1, activation=‘sigmoid’"
Loss = binary_crossentropy
Complex case and question
Input shape (150,150,75)
Output shape (150,150, 500)
The output shape has the same height and width than the input shape
The model will be trained with this output shape which is a grid of width 150 and depth 150. The number of channel is 500: It is the number of potential species that could be observed within each cell or pixel.
The final objective for this neural network would be to get a matrix of 150x150, with channel of 500 being the probabilities of each species presences within each pixel or cell.
I am just trying to define the architecture I should use (loss / activation function) to be able to output this shape (150,150,500).
The advantage of having this output shape is for example if I want to get all cell (or pixel) probabilities for species 1, I will just extract the first channel on the 150x150 output grid.
Maybe a way would be to flatten for the last layer height and width → 150 x 150
Which means that the output shape instead of shape (150, 150, 500) would be (22500, 500)
But I still can’t figure out what kind of loss and last activation function I should use, given the fact that there could be more than 1 species out of the 500 species in the same cell or pixel
This topic of semantic segmentation is covered in week 3 in course 4 in deep learning specialization. I can’t find the coursera video on youtube. Please search for it or even better take the course. The NN is called UNet.
As a matter of fact I took the course but didn’t make the link, because the content of my input and output channels are different (150x150x75 versus 150x150x500)
Input channels are environmental variables (75, like water,forest, etc.)
Output channels are species, as you suggested, a true mask of depth 500 different species, with some being present in a pixel and some not
And as you suggested, the objective would be to get the predicted maks with UNet
Would it then be all right to simply add a few layer at the end of the UNet to get the correct output shape or would I be distorting too much the intent of the UNet?
Since you’ve taken the course, one thing to notice is that the unet implementation on paper is different from the assignment. In the assignment, input shape is (96, 128, 3) and output shape is (96, 128, n_classes).
Given the match in first 2 dimensions, I’m in favor of creating a custom unet / reshaping input to match unet is than adding last few layers to get to the original width and height.
Why don’t you try both and reply to this thread with the results?