How to qualify unknown in CNN response variable?

Lets imagine that we are observing foxes and we write the presences (1) and the absences (0), in some square cell (1kmx1km) of a raster being like a matrix of cells covering the whole territory.

The neural network is a CNN:

  • The response variable, Y, is thus a matrix containing absences 0 and presences 1.
  • The predictor variables X, are for example altitude - vegetation…

Given the fact that we could conduct observations only in a few spots, we can thus attribute our observation values 0 or 1 to a few cells only.

How should the unknown or N/A values in Y be encoded ? Because all other cells that we didn’t have a chance to visit may in fact contain absence or presences of foxes.

→ Would assigning a neutral value of 0.5 (for unknown-N/A) in Y be appropriate? Having at the end 0 for absence, 1 for presence and 0.5 for unknown. The risk might be that the network will learn on unknown as well and thus introduce bias.

→ Another solution would be for Y to be a superposition of 2 matrixes, 1 matrix being foxesAbsence with value 1, another being foxesPresence with value 1. In this case, the unknown would be left implicit.

It should be mentioned that subsetting on Y by leaving aside unknown is not applicable for two reasons: 1) The network used is a CNN, 2) The response variable is a superposition of matrixes of other species as well. Therefore, a way to qualify N/A or unknown in the different species matrix constituting Y is needed.

Any thoughts or suggestion?

Hey @Manu,
That’s an interesting question. Before presenting my opinion, do help me to understand the question clearly please. When you say this:

Are these defined for the entire raster only once, or are they defined individually for the individual square cells (1km x 1km)? I am assuming the latter one, because otherwise how would we differentiate the square cells from one another? Moreover, how are these features defined for these square cells in the form of an image? Like one image for altitude of raster, one image for vegetation of raster and so on?

Additionally, do you want the network to predict “Unknown or N/A”, or do you want the network to predict presence or absence of foxes for these unknown square cells? Again, I am assuming the latter one, but still clarify please.

Lastly, please elaborate upon the following statement:

Regards,
Elemento

Hi @Elemento, thanks a lot for your interest and questions:

Single predictor X

  • Each variable, for example vegetation, is one raster layer with the shape nW, nH, nC → For example (100, 100, 1)
  • In this case, we have 100X100 cells, that could be 1km2 each
  • The value of vegetation (0 or 1) is then assigned to each cell. If no vegetation, the cell gets the value 0 or if vegetation is present within the cell, the cell gets the value 1

Multiple predictors X

  • There could be 15 other variables, such as water, altitude…
  • Each variable has the shape (100,100,1) with values within each cell. The values can be binary, as in the case of vegetation or continous, as in the case of altitude.
  • All raster layers are then stacked together
  • In the end, for 15 variables, you get X.shape = (100,100,15)

Single response variable y

  • For example foxes presence(1) or absence(0) within each cell
  • For a single species, Y.shape would then be (100,100,1)
  • The goal, is for network to predict the presence or absences of foxes within each unknown square cells, where observer could not go to see if foxes were there or not.

Multiple response variables Y

  • Other species could be added, for example deer absences - presence within each cell
  • For 2 species, Y.shape would then be: (100,100,2)

Model1: prediction of a single species

  • To lessen complexity, in a first phase, a network predicting the presence-absence of a single species will be done
  • In this case, X.shape = (100,100,15) and Y.shape = (100,100,1)
  • Based on the X predictors, the model should output the probability of presence or absence of foxes within each cell with unknown status. In other word, within each cell where observer could not go to conduct observations.
  • A CNN model is used, as it will process X (100,100,15) , then with a depth of 15 channels. It’s like processing a picture, but instead of having 3 colours channels, we now have 15.
  • In the case of single species prediction, the CNN shape output will be (100,100,1)

Model2: prediction of multiple species

  • To lessen complexity, model1 is elaborated in a first phase
  • In a second phase, the objective is to have the model output prediction for multiple species
  • In this example with 2 species (foxes and deers), it means that X.shape(100,100,15) and Y.shape(100,100,2), having this time an output with channels = 2.

Any thoughts about how to handle the unknown (N/A) cells in the response variable ?

Regards,
Manu

Hey @Manu,
It indeed is an amazing question. But let’s take it step by step.

From your above description, am I safe to conclude that the features in X which has a shape of possibly (100, 100, 15) aren’t related to pixels in any possible way? If my conclusion is correct, in that case, have you given any thought of posing it as a classification problem but instead of using CNNs, using a classification model, for instance, XGBoost, Decision Tree, Logistic Regression, etc?

Cause, if we have X with dimensions (100, 100, 15), we can simply unroll it, so that it has a shape of (10000, 15), and now, it’s nothing but a tabular dataset, and we can simply eliminate the rows having N/A values.

I am suggesting this because if your data doesn’t share any fundamental aspects of a typical image, like sharing horizontal edges, diagonal edges, circles, etc, CNN’s won’t be of much help to you, don’t you think?

Do let me know what do you think about this, and then we will discuss further?

Cheers!

There are methods for dealing with missing features (for example, if specific ‘x’ values from a given example are not available).

But with supervised learning, you need to have output labels for all of the examples. So if you have an unavailable Y value, that example needs to be removed from the training set.

Perhaps the method @Manu is proposing is not a good match for this task.

Hey @TMosh,
Thanks a lot for your input. I really missed out on the fact that we don’t have y values for examples having N/A. I will update my answer.

Regards,
Elemento

Hi @Elemento,
Thank you, excellent point regarding screening out n/a and tabular dataset. As a matter of fact, I did that already and used a Gradient Boosting Machine approach. It worked fine in terms of accuracy, but I found 2 problems with a tabular approach screening out n/a:

  1. I am missing the environmental structure, the value of environmental variables being just the values of the raster cell in which the presence or abences are found, not the whole environmental structure closeby such as a river networkor the interactions effects between different type of environmental variables (river network + rocky plain). In other word, CNN would enable this, by learning the environmental representation, as on an image. This comes back to your point, regarding the image. Handling the stack of X variables as an image would enable this and CNN would be a clear added values. For info, some research was already done on this : Convolutional neural networks improve species distribution modelling by capturing the spatial structure of the environment

  2. It takes 90 minutes for one single species processing and the objective would be to process 300 of them, regularly. I could of course do some optimization, but the most important is that single species prediction does not enable to learn an environmental representation common to a large number of species, which stabilizes predictions from one species to another.

Therefore, CNN seemed to be the next best move in this regard

Hey @TMosh,
Thanks for your feedback, as stated previously in my answer to @Elemento, some research was already done on this:

But as you say, the structure of Y need to be carrefully considered. In this research they used the presence of species only. The objective would be to improve on this and use absence points as well.

Hey @Manu,
I think from this answer, we can safely conclude that CNNs have a great advantage in your application, too great to discard them for a simple ML classification model.

So, now we have some grid cells for which we have missing X and y values. How about this? You may be able to find some techniques to handle invalid pixels in images, using which you can generate values for your raster data, i.e., X provided that the assumptions of these techniques hold true. Some of these I have mentioned below for your reference:

Once you employ one of these or perhaps some other technique, we will have a complete representation of X without any N/A values. Now, you can use a ML-based classification model for instance, XGBoost to predict the values of y for the grid cells, for which we have just generated the values of X. You can easily train this model on values of X and y which were already available. Once this step is done, you will have a complete X and y representation, and then you can employ CNNs.

Does it help you in any way? It may be computationally expensive and I am not sure if this is a valid method or not, but what is your opinion on this?

Regards,
Elemento

Hey @Elemento,
Thanks for these interesting articles and points of view. It helps me a lot with my thinking.

The main problem in my case is to handle N/A in Y, not in X. For example:

  • X1 is a variable is for example water, each pixel or cell has value 1 if river is found into it or 0 if the river is not found.
  • X2 is vegetation, 1 if found in a pixel, 0 if not found.

Therefore, no problem with N/A at the level of X, but the main issues are N/A in Y

Lets continue with a simple example to help my thinking process as well: we could imagine pictures or rasters made of only 4 pixels (or cells)

X

  • X1.shape (2,2,1) = water with value 1 if present in a pixel or 0 if absent
  • X2.shape (2,2,1) = vegetation with value 1 if present in a pixel or 0 if absent
    → the X input shape is thus (2,2,2)

Y for a single species:
Y.shape(2,2,1) = foxes with values:

  • 1 if found present in a pixel visited by an observer
  • 0 if found absent in a pixel visited by an observer
  • N/A if pixel not visited by an observer
    → the Y output shape is (2,2,1)
    → the values found in the 4 pixels or cells could be for example [1 , 0 , N/A , 1]

Three potential solutions:
I think we have three potential scenario to solve this problem.

1. Fill missing values (N/A) in Y
This would be the approach you mentioned above, using for example XGBoost on tabular data to fill N/A in Y.

The advantage is that it would enable to wage CNN in a second phase on multiple species that would be stacked in Y (Y1, Y2, Y3…)
The inconvenient is processing time as you mentioned. I am also thinking about potential bias, because in a first phase, the tabular processing XGBoost would remove the environment structure and just focus on single environmental cells values of X to predict Y. It would finally apply the learned function to predict value for all N/A of a species. In a second phase, CNN would be applied, which would mean benefiting from the environment structure but learning on some new values of Y (that were N/A before) that may not be entirely correct due to the limitation of tabular XGBoost processing.

2. Remove N/A weights
I do not know if it’s technically possible, but let’s imagine that we conduct a three classes classification training on Y

  • Class0 : absence
  • Class1 : presence
  • Class2: unkwown N/A

At test time or production time, would there be a way to only use the weights of Class0 and Class1 for future predictions, and thus discard the weights of Class2 ? Image segementation would enable such an output, but I do not know if we could discard or neutralize all weighst related to a specific class to only focus and use the weights related to the two other classes (Class0, Class1) to get a prediction.

3. Neutralize N/A cells in Y
I do not know if it’s technically possible either. Would it be possible to define an area or zone on a picture where the CNN algo should not consider learning. If we could define such “non-learning spots” on a picture, we could then apply this technique in our case.

What do you think?

Hey @Manu,
I guess the first point requires no further discussion. You have laid down it’s pros and cons very beautifully, and I agree with all of them.


Now coming to the second point. I haven’t seen this approach being used in any research till now, because as far as my understanding goes, we can’t draw a direct relationship between the weights and the classes. The weights are more related to the function that the neural network learns to predict one of the many classes. Even if, and that’s a big

IF

we learn to somehow modify the weights (and hence the function learnt by neural network) so that the network predicts only 2 classes (instead of 3), it will be predicting incorrectly on the cells that would have been originally classified as ‘N/A’, because the dataset trained the model to classify a cell into 3 classes, not 2 classes.

If you are wondering, why it’s a big if, this is because neural networks in general are black box models. So, understanding the function learnt by a neural network is not an easy task, and then modifying it, makes it even a harder task. There has been a ton of research in the past decade enhancing our understanding of neural networks, but whether that will be enough for this task or not, that I am uncertain of.

Additionally, classifying a cell as ‘N/A’ doesn’t make much of a sense to me, because we simply have missing labels for these square cells. These cells don’t have any difference in distribution of X from the cells that have the labels. It’s just that we have missing labels. So, how do you think, a model can possibly differentiate these cells. For example, consider 2 examples having same features, for one you have the label, for other you don’t have the label. It’s a completely valid case, and now, there is no way a model can differentiate between these 2 examples. So, even if we are able to somehow implement this approach, once again a big

IF

we will circle back to


Now coming to the third approach, this seems to be an interesting one. I assume you are thinking into some sort of masking approach. I thought about this too, and it seems to be pretty good, until something else came to my mind. Let’s say we apply the masking in the input layer, i.e., to X before it is fed to any layer. Now, what’s stopping the neural network to make some sense out of these masking values and use these to learn a function which will classify each of the cells as presence or absence of a species. We wanted to make the neural network exploit the spatial information in the first place, and I guess the neural network might make a lot more sense than we wanted it to. In the inference or production time, there will be no mask in any of the examples. So, will the performance be retained?

Another possible place to apply masking is while doing the cost computation. But if we do this, it is as good as adding 0 to the cost for cells having ‘N/A’ as their labels, which is another way of saying that for these cells, we have the perfect predictions, which is definitely wrong. So, how do you think we can employ this approach?


In conclusion, the second approach seems to be dead end to me. The first approach, you have defined pretty well, as I just said, and the third approach, could be possibly used, and I am pretty uncertain of this as well :joy:

Let me tag some other mentors, and they will surely be able to correct our perspectives if they are going wrong somewhere, or perhaps provide some new perspectives.

@TMosh @paulinpaloalto @rmwkwok @anon57530071 Guys, can you please look into this query and provide your opinions. Thanks in advance.

Cheers!

1 Like

Hey @Elemento, thanks for tagging me.

Hello @Manu, how are you? I have 2 ideas after reading your discussions.

#1
Your GBM worked fine, so why not just add some features to account for the environmental structure? A 3x3 filter takes only adjacant cells into account, so the GBM equivalence would be for each cell, you calculate the, for example, the sum of the surrounding 8 cells, and see if it improves your baseline accuracy. If not, I will expand one cell outward to aggregate 8+ 16 cells, and see the change.

#2
For your CNN, masking unknown y in the calculation of loss is a great idea, and you normalize your total loss by the number of known y. You probably need to add the mask into the loss function yourself.

Cheers!

#3

Another way would be, for each known y, you cut a smaller surrounding raster area out as your new X, then your new X-y dataset always have a known y.

The best cutting size should be by analysis of your data and/or domain knowledge and/or by experiments.

1 Like

Hey @rmwkwok,
Won’t the second approach, lead to the below issue?

Additionally, I am a little confused about the 3rd approach that you have mentioned. Does it involve replicating the values of y to it’s surrounding square cells having N/A as their labels based on KNN-sort-of algorithm for each of the smaller regions that we cut out of the original region?

Regards,
Elemento

No. First, in the forward phase, it will predict something for the unknown cells, but because you masked them, their influence will not be propagated back to the weights of your model.

Cost = J(\vec{\hat{y}}, \vec{y}) = J(\vec{\hat{y_{\text{known}}}}, \vec{y_{\text{known}}}) + J(\vec{\hat{y_{\text{unknown}}}}, \vec{y_{\text{unknown}}})

Cost_{\text{masked}} = J(\vec{\hat{y_{\text{known}}}}, \vec{y_{\text{known}}})

Again, you need to introduce the mask to the loss function implementation to make that happen.

No, not at all. Let’s say we have a 100 x 100 raster. And there are two cells with known y = 1, and their locations are (25, 30) and (72, 80), assuming the cutting size is 21 x 21, I am going to cut two squares centered at the above 2 locations. For the first one, it will be from 15 to 35 on x-axis and 20 to 40 on y-axis. And this new and smaller subraster will carry a label y = 1.

So, a 21x21 raster as the new X, and 1 as the label for this X.

@Manu Is 90 minutes the time for training a model for one species? Or is it the time for making prediction for one species in all rasters?

I am really confused now @rmwkwok :joy: I guess I have just said the same, haven’t I :thinking:

Since the labels are for the individual cells, then using y = 1 for all the cells in the region defined by (15, 20), (35, 20), (15, 40) and (35, 40) isn’t the same as replicating y = 1 for all these cells? And I am assuming if any of these cells have known y, then we will simply use those labels. Please correct me, if I am understanding it incorrectly.

Regards,
Elemento

No problem :slight_smile:

Now the new model will accept my new X and predict one value of y, not a matrix of y. It’s like accepting a photo and predict whether it is Cat or Not Cat.

But the problem requires us to predict the matrix of labels right :thinking:, i.e., defining whether the species is present in each cell or raster or not, instead of whether the species is present in the entire raster?

In that case we need to make more number of predictions to go through the whole 100 x 100 raster.