Hey @Manu,
I guess the first point requires no further discussion. You have laid down it’s pros and cons very beautifully, and I agree with all of them.
Now coming to the second point. I haven’t seen this approach being used in any research till now, because as far as my understanding goes, we can’t draw a direct relationship between the weights and the classes. The weights are more related to the function that the neural network learns to predict one of the many classes. Even if, and that’s a big
IF
we learn to somehow modify the weights (and hence the function learnt by neural network) so that the network predicts only 2 classes (instead of 3), it will be predicting incorrectly on the cells that would have been originally classified as ‘N/A’, because the dataset trained the model to classify a cell into 3 classes, not 2 classes.
If you are wondering, why it’s a big if, this is because neural networks in general are black box models. So, understanding the function learnt by a neural network is not an easy task, and then modifying it, makes it even a harder task. There has been a ton of research in the past decade enhancing our understanding of neural networks, but whether that will be enough for this task or not, that I am uncertain of.
Additionally, classifying a cell as ‘N/A’ doesn’t make much of a sense to me, because we simply have missing labels for these square cells. These cells don’t have any difference in distribution of X
from the cells that have the labels. It’s just that we have missing labels. So, how do you think, a model can possibly differentiate these cells. For example, consider 2 examples having same features, for one you have the label, for other you don’t have the label. It’s a completely valid case, and now, there is no way a model can differentiate between these 2 examples. So, even if we are able to somehow implement this approach, once again a big
IF
we will circle back to
Now coming to the third approach, this seems to be an interesting one. I assume you are thinking into some sort of masking approach. I thought about this too, and it seems to be pretty good, until something else came to my mind. Let’s say we apply the masking in the input layer, i.e., to X
before it is fed to any layer. Now, what’s stopping the neural network to make some sense out of these masking values and use these to learn a function which will classify each of the cells as presence or absence of a species. We wanted to make the neural network exploit the spatial information in the first place, and I guess the neural network might make a lot more sense than we wanted it to. In the inference or production time, there will be no mask in any of the examples. So, will the performance be retained?
Another possible place to apply masking is while doing the cost computation. But if we do this, it is as good as adding 0 to the cost for cells having ‘N/A’ as their labels, which is another way of saying that for these cells, we have the perfect predictions, which is definitely wrong. So, how do you think we can employ this approach?
In conclusion, the second approach seems to be dead end to me. The first approach, you have defined pretty well, as I just said, and the third approach, could be possibly used, and I am pretty uncertain of this as well 
Let me tag some other mentors, and they will surely be able to correct our perspectives if they are going wrong somewhere, or perhaps provide some new perspectives.
@TMosh @paulinpaloalto @rmwkwok @anon57530071 Guys, can you please look into this query and provide your opinions. Thanks in advance.
Cheers!