What is the Cost Function for Softmax?

ai_is_cool · May 9, 2025, 1:59pm

Take the function f(x) = cos(x). This is a function of the variable x. However f(0.5) = cos(0.5) is an evaluation of f(x) at x = 0.5.

ai_is_cool · May 9, 2025, 2:01pm

I prefer to use my time to learn from Andrew’s video presentation content on Advanced Learning Algorithms than spend time identifying discrepancies in material not advocated by Andrew.

conscell · May 9, 2025, 11:57pm

I mean in terms of the cross-entropy loss.

It is easy to see that the two following definitions are mathematically equivalent:

\displaystyle L^{(i)} = -\log\left( {e^{z_{y^{(i)}}} \over \sum_{k=1}^N e^{z_k}} \right);

L^{(i)} = \cases{\displaystyle -\log {e^{z_1} \over e^{z_1} + e^{z_2} + \dots + e^{z_N}} {\ \ \rm if\ } y^{(i)} = 1 \cr \displaystyle -\log {e^{z_2} \over e^{z_1} + e^{z_2} + \dots + e^{z_N}} {\ \ \rm if\ } y^{(i)} = 2 \cr \hfil \vdots \hfil \cr \displaystyle -\log {e^{z_N} \over e^{z_1} + e^{z_2} + \dots + e^{z_N}} {\ \ \rm if\ } y^{(i)} = N \cr }.

conscell · May 10, 2025, 12:12am

It’s sad to hear that, because your attention to subtle details could definitely help improve those materials for the community.

rmwkwok · May 10, 2025, 1:05am

Hello, @ai_is_cool,

Since I think you were looking for where Andrew explained something, the “slides” that are, as you found, not regularly maintained could serve as a set of text for us to search for video references, because we do not have a compilation of all Andrew’s transcripts in one place publicly.

With the help of those slides and a bit of my own effort, I found the following explanations by Andrew:

Please read the highlighted transcript, but on the screenshot you can find the time mark that roughly points to when he said those.

The key takeaway would be “by summing up the losses on all of the training examples … you then get the cost function”

In this 3-step framework, Andrew showed the general form of J which is in agreement with the transcript quoted above. Although it is unfortunate that he used binary classification instead of multiclass for the example of L, if we go to the next screenshot:

I think we can see that Andrew was reusing the same framework, only at this time he talked about multiclass. The same framework, including the general form of J, should apply here, too.

The cost being the sum of losses, as explained in the video from which the first screenshot here was extracted, is the definition universally used by these courses for all of squared loss, logistic loss, and the categorical crossentropy loss (which makes use of softmax output).

With the help of the “slides”, it took me less than 15 minutes to get these screenshots. I hoped I could find a screen that showed specifically how, mathematically, the cost function for multiclass problem is defined, but I might have missed that.

Cheers,
Raymond

ai_is_cool · May 10, 2025, 8:56am

All I really want to know is what is the mathematical expression of the loss function and the mathematical expression of the cost function for the softmax activation function as Andrew made a mistake when he was talking about the loss function and called it the cost function instead.

ai_is_cool · May 10, 2025, 9:00am

It’s a pity that you didn’t explain this notation earlier involving e^{z_{y^{(i)}}}

ai_is_cool · May 10, 2025, 9:28am

You haven’t completed the conditional expression;

if y^{(i)}

saying what must be true for each of the N expressions.

Do you mean;

if y^{(i)} = 1
if y^{(i)} = 2
etc

If so, do you mean that the loss function takes on one of N possible values for each input training example X(i)?

ai_is_cool · May 10, 2025, 9:37am

Not exactly.

ai_is_cool · May 10, 2025, 9:40am

I would encourage you to contact those contributors who were involved in producing the slides to review their own contributions to these slide materials as they are best-placed to make corrections and learn from their experience to adopt a more attention-to-detail approach to authoring technical content.

conscell · May 10, 2025, 11:32am

I’ve completed the conditional expression. It seems like your device cuts off the right part of the expression. This is how my reply should look:

The loss function takes N inputs, one true label and returns one value.

conscell · May 10, 2025, 11:34am

It’s a pity that you didn’t explain this notation earlier involving e^{z_{y^{(i)}}}.

Apology for that. I was assuming that it’s pretty obvious.

conscell · May 10, 2025, 11:37am

Not exactly.

Well spotted! The common reduction mechanism is the average.

UPD: But the sum can be used as well.

ai_is_cool · May 10, 2025, 12:07pm

Best to be explicit and not make assumptions where mathematics is concerned I find.

ai_is_cool · May 10, 2025, 12:10pm

Where is the one true label output variable in your loss function definition? I can only see the N logit values.

conscell · May 10, 2025, 12:24pm

Where is the one true label output variable in your loss function definition?

It is y^{(i)} in the conditional (if y^{(i)} = 1, if y^{(i)} = 2, \dots, if y^{(i)} = N) expressions.

ai_is_cool · May 10, 2025, 12:34pm

So when calling the loss function in Python code does one pass the extra variable y^{(i)} as well as z_k, for k = {1…N}?

conscell · May 10, 2025, 12:54pm

Yes, y^{(i)} as well as z_k should be the argument of the loss function:

y_true = [2]  # y^{(i)}
y_pred = [[0.1, 0.8, 0.1]]   # z_k
# Using 'auto'/'sum_over_batch_size' reduction type.
scce = keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred)

ai_is_cool · May 10, 2025, 1:08pm

What does this comment mean;

conscell · May 10, 2025, 1:26pm

As I mentioned before, machine learning frameworks use the term loss function for the cost function as well. Let’s have a closer look:

tf.keras.losses.SparseCategoricalCrossentropy(
    from_logits=False,
    ignore_class=None,
    reduction='sum_over_batch_size',
    name='sparse_categorical_crossentropy'
)

The default value for the reductionargument is “sum_over_batch_size”, which means the cost function (the average of the loss function) for the batch will be computed. For one training example it will act as the loss function.

Topic		Replies	Views
RNN Cost Function Sequence Models	3	504	April 20, 2023
Ground truth label Advanced Learning Algorithms week-2	3	518	June 21, 2022
Cost Function and Loss Function Supervised ML: Regression and Classification week-3	10	839	September 20, 2023
The Indicator function in the softmax cost function Advanced Learning Algorithms week-2	3	274	March 4, 2024
Cost function of Softmax function Advanced Learning Algorithms week-2	2	236	May 26, 2024

What is the Cost Function for Softmax?

Related topics