Derivation of softmax function from sigmoid function of logistic regression

Hello Mentors,

I have a question regarding the derivation of the softmax function from the sigmoid.

The softmax function is a generalized form of the sigmoid function. I was curious to know the derivation.

I have also checked Raymond’s explanation in the topic below -

I wondered how sigmoid(z1-z2) is equivalent to sigmoid(z1)?

It would be great if someone could explain the derivation of the softmax function from the Sigmoid function.

Thanks
Tamal

My response here:

Let’s see if it makes sense to everyone :wink:

Cheers,
Raymond

I got lost in the algebra when I tried to write out a proof.

The other thread doesn’t work out the derivation of your formula either.

It would be really informative to see it worked out.

1 Like

Hi @tamalmallick,

For 2 classes the softmax is identical to the sigmoid:

\begin{align} {\bf z} = [z, 0] \\ {\rm Softmax}({\bf z})_1 &= {e^z \over e^z + e^0} = {e^z \over e^z + 1} = {1 \over 1 + e^{-z}} = \sigma(z) \\ {\rm Softmax}({\bf z})_2 &= {e^0 \over e^z + e^0} = {1 \over e^z + 1} = 1 - \sigma(z) \end{align}

Hi @consell,

Thanks for this, but I have a follow-up question. You mentioned z = [z,0]. I wondered how z becomes 0 in 2 classes when z is a linear expression z=W.X+b?

That seems to be a specific example where it is assumed that the output of the first unit is ‘z’, and the output of the second unit is zero.

I’d like to see a derivation where z1 and z2 are both any real numbers.

1 Like

@tamalmallick,

The softmax function is shift invariant for any constant c:

\begin{align} {\rm Softmax}([z_1, z_2]) &= \left[ {e^{z_1} \over e^{z_1} + e^{z_2}}, {e^{z_2} \over e^{z_1} + e^{z_2}} \right] \\ &= \left[ {e^{z_1} e^{-c} \over e^{-c} (e^{z_1} + e^{z_2})}, {e^{z_2} e^{-c} \over e^{-c} (e^{z_1} + e^{z_2})} \right] \\ &= \left[ {e^{z_1 -c} \over e^{z_1 - c} + e^{z_2 - c}}, {e^{z_2 - c} \over e^{z_1 - c} + e^{z_2 - c}} \right] \\ &= {\rm Softmax}([z_1 - c, z_2 -c]). \end{align}

(This fact is used to ensure numerical stability by subtracting the maximum logit from all logits)
So, for binary classification, we can shift the logits by setting c = z_2:

{\rm Softmax}([z_1, z_2]) = {\rm Softmax}([z_1 - z_2, 0]).

Let z = z_1 - z_2. Then:

\begin{align} {\rm Softmax}([z_1, z_2]) &= {\rm Softmax}([z, 0]) \\ &= [\sigma(z), 1 - \sigma(z)] \\ &= [\sigma(z_1 - z_2), 1 - \sigma(z_1 - z_2)], \end{align}

Which is similar to the result presented by Raymond @rmwkwok, which shows that binary softmax is equivalent to sigmoid over the logit difference.
I am not sure what exactly you are trying to derive, because the softmax function converts logits z_i into probability over classes, whereas the sigmoid function maps a single logit to a probability between 0 and 1. In particular, softmax is used when you have multiple mutually exclusive classes, and it ensures that the outputs sum to 1 across all classes. Sigmoid, on the other hand, is used when you’re modeling the probability of a single class or multiple independent classes, and does not enforce that the outputs sum to 1.

“Math processing errors”.

“Math processing errors”.

Please try to reload the page, or try another browser. This error sometimes occurs on my iPhone when connection is slow.

Screenshot

I’m on Firefox on Windows 10. The “math processing error” seems to have resolved itself.