W3 - Softmax cost function: punish false convictions?

Hello,
In the “Training a Softmax Classifier” video, Andrew Ng mentioned the natural generalization of the logistic regression by choosing the cost function associated to a softmax guess like so:
image

I get that the important thing, at first glance, is that you “punish” your model for not understanding that it was a cat (the cost being equal to -log(0.2) here), regardless of what it was thinking of the other classes.

But is that really the only important thing? What I mean by that is: in this picture, the guess is [0.3 0.2 0.1 0.4] instead of [0 1 0 0], but it could have been much “worse” if for example the network was convinced it was looking at a dog, outputing, say [0 0.2 0.8 0].

I feel like we could benefit from telling our network that “not knowing” is better than “being convinced of a mistake”.
To this end, maybe we could modify the cost function with, I don’t know, some L2 (or higher) norm of the other guesses (for example here, adding a supplementary cost of 0.3^2 + 0.1^2 + 0.4^2)?

I’d be glad to hear your thoughts on this.

Thanks for reading :slight_smile:

Hey @Goudout,
That’s an interesting question. Let me present you with my opinion, considering the case of multi-class classification (4 classes). First of all, I would like to take your attention to the below point:

Personally, I don’t think the network has an option of not knowing, since irrespective of what the network predicts, it can be either correct or not, and the current loss function, i.e., categorical cross entropy does a great job of calculating the loss based on with how much certainty, the network predicts the correct class.

Now, just as a hypothesis, let’s say that the network predicting [0.25, 0.25, 0.25, 0.25] is referred to as network not knowing. But even when the network predicts this, it’s no better to us than [0.3, 0.2, 0.1, 0.4], because, even in the former scenario, there is a 75% chance that the network will predict wrong. In other words, the network making incorrect predictions with certainty is pretty much the same as the network resorting to making random decisions (or not knowing).

Consider a simple analogy for reference. Let’s say you are teaching a person how to drive a car. Not hitting anything is the primary objective. Now, if the person hits something" is pretty much the same as the person not knowing “what to do”, cause as soon as the person will start driving, he will resort to making random decisions and trust me, neither of us would like to sit with such a person :joy: , cause there is a soaring high probability that he will hit something.

In short, selecting the correct class with absolutely certainty is our aim. Being certain about an incorrect class or being uncertain about all classes don’t differ much considering our aim.

Do share your thoughts on this, and then we will discuss further.

Regards,
Elemento

Dear @Elemento,

Thanks for your answer. I completly understand what you mean, but I don’t quite totally agree. You’re right that in the end, the absolute most important thing is that the NN is selecting the right class with certainty. But like with humans, I feel like it is easier to teach something to someone who has no opinion, rather than to someone who thinks the total opposite.
In that regard, I feel like penalizing totally wrong answers (i.e. prefering [0.25 0.25 0.25 0.25] to [0.25 0 0 0.75] when trying to predict [1 0 0 0]) might be a good idea to speed up the process, even though I’m totally after a certain amount of iterations, the result might be the exact same.

And even after lots of iterations, it might bring some nuances. Imagine trying to predict a cat [1 0 0 0], and having the NN outputing [0.5 0.5 0 0]. This might indicate that maybe the cat resembles a dog for our NN, wheareas if the NN output [0.5 0.16 0.16 0.17], maybe it’s just that the cat is blury and there’s nothing to do about it. In the later case, there is a small penalty, in contrast with the previous case, where we penalize the NN for believing it might be a dog. Maybe this might lead to the NN finding new/better ways of discriminating between cats and dogs?

I don’t know if I make sense :slight_smile:. I’m pretty much totally new to machine learning so I’m afraid to sound a bit silly, but thanks for kindly answering with an open mind.

Best regards.

Élie

This is an interesting point and a good discussion! I think we can state things more simply like this:

The way softmax works, we know that the outputs add up to 1. That’s the whole point. So if we know the value that it assigns to the correct class of the label, then we know the sum of the rest of the (wrong) answers. The way Prof Ng writes the cost function, we only use the value for the class of the label. So we are saying that we don’t care how the wrong answers are distributed: we only need to care about the value assigned to the correct label. Apparently this works well enough in practice that it is the way everyone does this at this point in time. In other words, that is the current “state of the art”.

So if you think that your idea would be an improvement, then you need to construct some experiments to demonstrate that. So how would you construct your additional term to add to the cost function that would penalize more severely the case in which the wrong answers are more “concentrated” as opposed to just smeared around? Then try some experiments and see what happens. You’re introducing another level of tradeoff where you make the loss and gradient calculations more complex, so you need to be able to show that there is a corresponding benefit in terms of faster convergence that would outweigh the added complexity and computational cost. If you can show that, then you need to write it up in a paper and maybe next year everyone will be using Goudout Cost for multiclass classification! Your name in lights! :beers: :nerd_face:

Thanks for your interest @paulinpaloalto,

I’ll eventually get around playing with this probably (didn’t think too much about it yet):


Of course with p=1 we stick with the usual cost (up to a constant, so, it’s equivalent), and with large p’s (potentially +inf) we add more penalizing. I’m also not sure yet, but I don’t think it’s computationally heavier in a meaningful way.

If I ever do get around this, I’ll be back with the results. If anyone is already set up to try this out quickly, and has the motivation to do so, then much love to that person :slight_smile:

See you around.
Élie

Hey @Goudout,
We will be looking forward to the results. I would like to add some concluding remarks.

I guess the extra computation lies in calculating the gradient descent. Now, it’s just an extra term you might think, but if we imagine the case when we have a considerable batch size and a considerable number of layers, this “extra term” will propagate further down many layers and will have extra vector computations. No doubt we have packages to do these efficiently, but if we view it from a different perspective, it’s simply “extra computation” at the efficiency scale of these packages.

Another point that I found extremely interesting in @paulinpaloalto’s explanation is as below:

In other words, when we are computing the loss, we are indirectly taking into account the total certainty with which the network is making wrong predictions. Now, it may not be the same as penalizing individual incorrect predictions, but is goes along the lines of penalizing a cumulative of incorrect predictions. For instance, consider 2 predictions made by the network [0.1, 0.3, 0.4, 0.2] and [0.1, 0.1, 0.5, 0.3]. Now, the current loss function only takes 0.3 and 0.1 into account, but we know that 0.3 = 1 - 0.7 and 0.1 = 1 - 0.9, where 0.7 and 0.9 represent cumulatives of incorrect predictions. I guess you get the point I am trying to make. I hope these help.

Regards,
Elemento

Hi,

Thanks for the point regarding the computations. I might have underestimated the added complexity a bit indeed.

Regarding your second point, I get what you mean. But the idea behind my “new” cost function is not to penalize the cumulative error per say (this would correspond to taking p=1 in my previous message, and thus using the usual cost function), but rather whether the error is along an axis or not.
The extreme case of this would be with p=+inf in my precedent message. Imagine you have N dimensions and the true label is [1 0 ... 0]. Now imagine the 2 bad guesses bg1 = [1/N 1/N ... 1/N] and bg2 = [1/N 0 ... 0 (N-1)/N]. With the usual cost function, bg1 and bg2 cost the exact same: log(N). But with my “new” cost and p=+inf, the added cost of bg1 would be (with N large) ~1/N, while the added cost of bg2 would be log(N). If you compare the total new cost to the usual cost, it’s essentially “not changing” in the first case, and “doubling” in the second case.

In the first case, your NN is still trying to learn, so it’s “ok” if it doesn’t know what it says, thus the usual cost seems appropriate. And in the second case, the cost doubled, which seems to me like a good idea because you need to teach your NN that it’s doing complete garbage (being almost certain of an error) and needs to take this error into account more seriously.

Best regards.
Élie

Hey @Goudout,
I do understand what you are trying to do, and that’s why I mentioned

It indeed is different from what you are trying to do, but if someone asks me to draw a similarity between the 2 approaches, this is something that I could point out, i.e., both the approaches are penalizing the incorrect predictions (though in different ways).

Cheers,
Elemento

Just to add a little more about the compute cost: you’ve at least doubled the compute cost of the loss calculation and all the gradients that go with that for the back prop. Now that incremental cost may be a small percentage of the overall compute cost of a single epoch of training, but training cost is a big deal in the “real world”. Maybe it’s an extreme case, but did you read any of the articles about the new GPT-3 model that was recently released? 175 billion parameters and 45 terabytes of training data. So if you increase the cost in any incremental way, you need to justify that by showing that your method ends up requiring fewer epochs of training to achieve the same or better results. The high level point is that training cost is not a trivial matter: any additional cost needs to be justified.

1 Like