Softmax layer preds

Hi, community! Today I faced an issue understanding softmax. For example, when classifying handwritten digits from 0 to 9, how do we determine which node in the softmax layer corresponds to which digit? In video examples, we simply assume they are ordered from 0 to 9, but that doesn’t seem natural.

My guess: Maybe the position of the prediction is determined by the cost function.

In a multiclass classifier, you have to define your labels. If you are classifying animals, for example, you have to provide the definition of which animal is label 0, which is 1 and so forth. The normal way to do that is to provide an array as in ["aarvark", "cheetah", "elephant", "mongoose"]. Then the pictures need to be correctly labeled with the index of the matching animal.

If the inputs are pictures of handwritten digits from 0 to 9, then you could decide that digit 4 is label 9 and 5 is label 3, but why on Earth would you do that? The natural mapping is obvious and the labels should conform to that.

The cost function for softmax is multiclass cross entropy. The cost function doesn’t define the labels: it merely uses them. You as the system designer have to define the labels and make sure your input samples are correctly labeled.

1 Like

Thanks for your reply! Yes, you are absolutely right about labeling, but the neural network ends with a softmax layer that, in our case (digits), contains 10 nodes. How do we know which node corresponds to which digit?

And what if we had 12 nodes? Yes, it would be useless, but how do we determine which node represents what? Maybe at the end of training, we classify a large number of digits and observe which nodes in the softmax layer have the highest probabilities?

If you have 10 outputs from the softmax layer, then they are indexed 0 to 9, right? So the model predicts based on the corresponding labels on the training samples: pictures of “0” are labeled with index 0, so if the model recognizes the image as a “0” it will output a higher probability on index 0 of the output for that sample. If the model recognizes a “3” in the image, then the output for index 3 will have the highest value. And so forth …

For every input, we get a probability distribution on the outputs from 0 to 9. We select the highest value as being the prediction.

Suppose we have 12 output classes, e.g. 12 different types of animals. Then the outputs will work the same way: the inputs are labeled 0 to 11 and for each input the model gives the highest probability to the output that it believes represents the correct label (type of animal) for that input.

1 Like

Or if what you mean is you define your model to have 12 output classes for the digit recognition case, then the point is that there should be no inputs labeled 10 or 11, right? What would that even mean? There are only 10 digits so all the images would be 0 to 9, so the model should always predict probability 0 for output 10 or 11.

Or you could enhance the model to say that there are 11 outputs and index 10 (the last one) means “none of the above”. But that value on an actual prediction from the model will only ever be non-zero if your input dataset has some samples that are labeled “none of the above” (label 10), so that the model can learn to recognize that case.

Edit: well, we’d have to try this experiment. If there are no training samples with a given label, you might get random small outputs for the missing label, because we start with random weights for symmetry breaking. There would be nothing to force the cost function to try to increase that output value, but maybe there’s nothing to force it to zero either. But there are samples what would force higher values on the other “real” labels, so the softmax values should never select label 10, unless there are actual training samples with that label.

1 Like

I think I’ve got the idea! We use one-hot encoding for labels—that’s how we get a distribution like [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] to represent the digit 2, for example. Then we teach nn so the softmax will output the same distribution.

We train the model to learn this distribution and penalize it if the output is incorrect (this is where the loss function comes in, also at this point we can see how to penalize model specificly for that sample so the softmax probs will be closer to [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]). If we use mini-batches => we combine all individual losses into a single value (the cost function) => model capture the overall trend of how the weights should be adjusted to correct the output for multiple samples while also smoothing out fluctuations caused by individual samples

But I still dont get an idea of learning on just 0, 1, 2…9 labels

You are right on this one. We have a labeled dataset of digits, with exactly 10 labels. I got a bit mixed up between supervised and unsupervised learning when starded to talk about 12 nodes. Thanks for clarifying

Hello, @Artem_Vashina, the idea is the same.

Let’s say we have a sample, and we get a prediction. Because it has 10 classes, the prediction has 10 probability values:

[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

One-hot label

Now, with [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] as the label, essentially, it picks only the 2nd (0-base) probability value to compute the cost. The “picking” can be easily done with a element-wise multiplication because after that, everything else are zeros except for the 2nd one:

Digit label

With 2 as the label, then the picking is even more intuitive, right? There is some function to do such a picking.

After picking.

With the picked, logged probability, we can then compute the cost. In other words, we only get involved the probability for the true class and ignore all the others.

Since in your last understanding, you emphasized on “closer to [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]” which is a very very good and precise intuition, so I suppose you might wonder, given that we ignore the others, would it still get closer to the distribution?

The answer is Yes! Getting only the true class involved in the computation of cost does not mean that only the predicted probability for the true class will get close to 1. The fact is, at the same time, the predicted probabilities for all of the false classes will also get close to 0, owing to the fact that each and every probability value, including the picked one, is a result that involve all predictions for all classes. Remember our softmax formula:

image

I will stop here for now. We can dig deeper if you want. Let us know.

Cheers!
Raymond

1 Like

When you run the training, you have two choices: you can convert them to “one hot” labels and use the CategoricalCrossEntropy loss function. Or you can leave the labels in “categorical” form and use the SparseCategoricalCrossEntropy loss function. I have not looked at the source code for the latter, but my guess is that it internally converts the categorical labels to “one hot” for you.

When you use the model in “inference” mode to do predictions (so the loss function is not in the sequence), you can just use “argmax” on the softmax outputs to get the answer in “categorical” form.

2 Likes

Actually yesterday I had traced the latter and stopped when I saw they used the gather method to do the picking. gather allows you to give it a tensor and some indices.

This is the doc of the high-level version of gather.