Why use Softmax instead of a linear transform that sums to 1?

Hey, this question is not applicable to any specific course. But my doubt was:

If we have a group of output neurons and we want to map them to represent the concept of probabilities that sum up to 1. Why do we use softmax which is e^x(i)/sigma(e^x)instead of something like:
Probability vector=abs(x(i))/sigma(abs(x)) where abs is the absolute value to handle negative values.
The latter also ensures all numbers in the vector sum to one providing a simple way to think about probability.
What’s the benefit of using softmax?
My intuition tells me this has something to do with the loss function, but I would appreciate a derivation.

Hi @Jaskeerat. You want to have a good probability measure, not one that simply satisfies the axioms. In classification tasks, we are trying to predict a probability that a particular example is a member of a class. The probability model here is based on the idea of independent repeat trials that can take on n outcomes (n = 2 in the case of binary classification, n > 2 for multinomial classification).

The binomial distribution is associated with independent repeated Bernoulli trails; the multinomial (aka “multinoulli”) distribution is associated with independent, repeated trials that generalize from Bernoulli trails with two outcomes to more than two outcomes (e.g. weighted coin tosses in the case of binary (binomial) classification; tosses of a weighted die in the case of multinomial classification).

Just as the logistic cost function (binary cross-entropy) can be derived applying the likelihood principle on the Bernoulli distribution, the softmax (categorical cross entropy) function can be similarly derived from the multinomial distribution. These functions also have very appealing properties from an information-theoretic perspective.

If you have not already, I strongly recommend that you study the optional video at the end of Week 2: Explanation of logistic regression cost function.

Bottom line: The softmax function is derived from a natural probability model for classification tasks. BTW, your suggestion is based on normalized “Manhattan distance,” or the L-1 norm.

2 Likes

Hey, thanks! I will go through the derivation, but aside from that is there any benefit in using it over the one I described or any other which gives a transformation to the output vector to make it sum to one.
What’s the practical benefit of raising it to the power of e.

@kenb Sidenote: I just completed this specialization and I wanted to thank you. I was welcomed to the community by your answer on my topic of decision boundaries. Instead of directly giving the answer away, you gave me more questions to think about. And while initially putting off, thinking about those questions allowed me to slowly work my way to a much stronger intuition about the entire subject. I just thought it was the perfect way to answer a question, and in that moment I knew I would love to help people in the same way. Thanks for starting me off on my journey and I wish you the best on yours!

1 Like

Congratulations on finishing the course @Jaskeerat. And, for your kind words. :grinning: Onwards and upwards!

But first, now that you have finished the course and I do not have the opportunity to needlessly confuse you, you might take a look at the following regression equation:

logistic

This is the logistic regression (or “logit”) model from the statistics literature. Here y is some variable that naturally takes values between 0 and 1. For example, a probability! In that case, the argument to the log function on the left-hand side is the inverse of the odds ratio. Example: if y is 0.2, the odd ratio equals 4. We say that the odds are 4-1 (“4 to 1”). The lower the probability, the higher (or “longer”) the “odds”.

Now solve the equation for y. (Feel free to substitute z for the right-hand side to simplify your calculation.) You quickly see how the natural base e arises and how the equation above is just another expression of the logistic regression equation introduced in week 2. Now you might ask why the natural log on the left-hand side? Think about the domain and range of the log function for that one.

1 Like

y=sigmoid(w.T x+b), Domain of the log function is positive numbers i.e +R-{0} Range is R.

@kenb I understand the derivation of binary cross-entropy loss from Bernoulli distribution and derivation of categorical cross-entropy loss from multinomial distribution.

But I thought/assumed sigmoid and softmax were arbitrarily chosen just to give convert the output value of z between 0 and 1 to give a sense of probability in the case of sigmoid and in the case of softmax, convert a vector of z[L] to sum up to 1 giving each a sense of probability. I didn’t/still don’t understand how these last layer activation functions are related to/derived from the loss.

As in I would think in logistic regression, I could use any other non-linear non-sigmoid activation function that gives a value between 0 and 1 for the last layer and still use binary cross-entropy as the loss function?
Is this correct? If not, why not?

Bottom Line:
The cross entropy losses can be derived from the distribution, Is there a similiar derivation of sigmoid activation function and softmax function from the Bernoulli/multinomial distributions?

Hello again @Jaskeerat. Derivation may be too strong a word choice. The choice of the loss function is closely related to the choice of output units (activations in the output layer, e.g., sigmoid, softmax). In this sense, they are not chosen arbitrarily.

One must keep in mind here what the output layer is trying to accomplish. The hidden features that feed in as inputs to the output layer have no discernable meaning. So the output layer provides additional information to complete the task at hand. Here, it is a classification task so that probabilities to help decide which category an example belongs to, are useful.

And, importantly, since one can normalize any list of positive values by dividing by their sum – so that all the normalized values are in [0,1] and sum to one – the activation functions of the output layer are certaintly not chosen on that basis alone!

So we are looking for something special about sigmoid and softmax, right? OK, buckle up. Assuming one has a strong foundation in probability and statistics, it is understood that the loss function (cost function is the average of the losses) is typically chosen to be the cross-entropy between the data and model probability distributions. And that, is equivalent to the negative of the log-likelihood function of the model distribution. In the case of binary and multi-class classification, the model distributions are naturally chosen to be the Bernoulli and multinomial distributions.

The above emphasis turns out to be critical. Because log-probabilities are common in the statistical modeling literature. This was exactly my point in having you construct the sigmoid function from the logit model, because it shows you how the sigmoid unit can be motivated using the assumption that log probabilities are linear in z.

Confession: I played a small trick on you by taking the log of the inverse odds-ratio. This automatically normalized the probabilities (i.e. they would sum to one) in advance. You could try it again using log(y) on the right-side and then apply the normalization after the fact.

But what you did not know, was that log probabilities are special. You also didn’t know that cross-entropy is closely related to maximum likelihood and maximum likelihood is a miracle! Trust me. If you are still confused after thinking about this, good! I have spent most of my life that way and I can report that its (mostly) harmless. In fact, its the only thing that guarantees those “Ah-ha!” moments.

The internet is an ocean of information, so I encourage you to to practice your Google-Fu as you move on. On that note, I hope that you are enjoying Course 2!

Onwards and upwards!