Replacing "sigma(x)" by "softmax(a,b)" in an ANN doing 2-class classification: a diagram

dtonhofer · March 31, 2025, 8:42pm

In the course we hear that, for 2-class classification, we can replace the final activation function of the ANN, \sigma(x), with softmax(a,b) without changing anything, but it is not immediately clear how this should be done (note that the slide says “logistic regression” but that should really say “deep logistic classification” as we may have more than 1 layer):

To clarify this, a diagram (or rather, several diagrams)

Standard case

First, the diagram for the “standard” case. It has become somewhat large.

2 layer neural network
Final activation function is \sigma(x)
The network tries to estimate the probability of the true label Y^{(j)} being 1 for input X^{(j)} (j being the example index, as usual).
- This is not the same as: the network tries to guess the true label Y^{(j)} for input X^{(j)} (maybe by giving a fuzzy membership value). Realizing that actually took me some time.
LOSS and COST are computed based on negative log-likelihood.

Intermediate, one-hot

As an intermediate variation, we can switch to a “one-hot” representation for Y, i.e. the Y matrix now has two rows, with “one hot”. This goes together with a two-row representation of AL, the last output, which carries no additional information, but is defined as \begin{bmatrix} 1-AL^{(j)} \\ AL^{(j)} \end{bmatrix} for all example j. Nothing changes regarding computation.

Only the parts relevant to the change are given below:

With softmax

Here we replace the \sigma(x) with softmax(a,b) where a,b a derived from Z^{[2]} in a straightforward fashion. It all works out and nothing changes regarding computation.

(Generalization: instead of just considering the case where Z^{[2](j)} is expanded to \begin{bmatrix} -Z^{[2](j)} \\ 0 \end{bmatrix} for all j, we can define a new free parameter \mu (either a constant for the ANN or a function of Z^{[2](j)}) and expand Z^{[2](j)} to \begin{bmatrix} \mu - Z^{[2](j)} \\ \mu \end{bmatrix}

rmwkwok · April 1, 2025, 2:29am

It should be the probability of the true label being “1” instead?

Here is how I would understand it:

Cheers,
Raymond

dtonhofer · April 1, 2025, 1:30pm

It should be the probability of the true label being “1” instead?

It could be. After all that’s arbitrary (N.B. we have classes 0 and 1, so I would tend to estimate the P for the class with the “lower” label) How does the course do it?

Okay, the course does it the reverse way than I do.

Doing it like the course does will change LOSS computation, making it equal to the interpretation “the ANN makes a guess at the true label by giving a fuzzy membership value”.

Gonna fix that in the diagram.

Here is how I would understand it:

Makes sense. So we have this:

sigma-terminated ANN transformed into softmax-terminated ANN

We have a free parameter a that can be chosen by taste

softmax-terminated ANN transformed into sigma-terminated ANN

The “2-class softmax ANN” probably not be built, having two outputs where one will do.

rmwkwok · April 1, 2025, 3:15pm

Yes. What’s more, if we read the loss function -y \log(p) - (1-y) \log(1-p), we see that when the label is y=1, we expect p=1 to have zero loss. In other words, it has to be the probability of being class 1.

The only thing arbitary is how you interpret what 1 verbally means, or what examples you assign with a label of 1.

rmwkwok · April 1, 2025, 3:27pm

Btw, David, I raised the following as a question because I remembered in one of your earlier diagrams, you had said it was the probability of being 1, so I thought it was a typo here. It is good that we have revisited it here and it is an important step for learning.

dtonhofer · April 1, 2025, 6:05pm

Hmm hmmm … it’s subtle. It’s actually a problem of cascading conventions.

I have corrected the other diagrams but I just noticed that, if I compute LOSS using the scalar product between the one-hot Y matrix of shape (2,m) and matrix of probabilities computed from Z^{[2]}, then in my hand-drawn diagrams, the matrix of probabilities is the wrong way 'round.

OTOH if LOSS is just said to be computed as -( (1-Y)∙log(1-AL) + Y∙log(AL) ) no matter what, it’s still okay.

Oh well! I shall add a not to the diagrams.

Topic		Replies	Views
C2_W3_multiclassification Improving Deep Neural Networks: Hyperparameter tun coursera-platform	3	533	September 5, 2022
In Softmax regression, will the new activation function replace all Of Relu and sigmoids? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	569	June 1, 2022
Softmax Loss Function for single example Advanced Learning Algorithms week-module-2	18	595	December 30, 2022
Softmax layer at last layer Neural Networks and Deep Learning coursera-platform	1	544	April 15, 2022
Why softmax in last layer for multiclass NN? Improving Deep Neural Networks: Hyperparameter tun coursera-platform	5	590	January 7, 2022