What is the Cost Function for Softmax?

Andrew talks a lot about the cost function for Softmax but he has only ever defined the loss function mathematically up to the video lesson Multiclass Classification in Week 2 of Advanced Learning Algorithms. He has never actually defined what the cost function is. So I have no idea what the cost function is for Softmax. He seems to get mixed up between the loss function and the cost function for Softmax.

Can anyone tell me what it is?

Also what on earth does logits mean?

Hi @ai_is_cool,

Prof. Ng defines the loss function for a single training example (x^{(i)}, y^{(i)}) as
\displaystyle L^{(i)}=−\log \left({e^{z_{y^{(i)}}} \over \sum_{k=1}^N ​e^{z_k}}​​ \right) (slide 25),
where N is the number of classes; y^{(i)} is the correct label for example i; z_k are the raw outputs (logits) for each class k. This is called the cross-entropy loss with softmax activation.
Then, the cost function is the average of this loss over all m training examples:
\displaystyle J(W, B) = {1 \over m} \sum_{i = 1}^m L^{(i)} (slide 6).

Logits are the raw outputs of a neural network before applying the softmax function.
In many machine learning frameworks, the terms loss function and cost function are used interchangeably, even though they technically refer to different things.

3 Likes

Thank you @conscell,

What are “slides”? I can only see continuous-play video lessons in the Week 2 course of Advanced Learning Algorithms.

Its confusing when a loss function is used interchangeably with a cost function as they are different functions and represent different things. Its a bit like saying cos(x) and sin(x) are used interchangeably.

“Slides” refers to the presentation slide deck that is used in the lectures. These can be found in the MLS course forum area “Resources” topic.

@conscell @TMosh Please can we adhere to the course content rather than alternative material not advocated by Prof. Ng?

It gets confusing over what is part of the course and what is not part of the course.

Thank you

It’s Andrew’s own lecture slides from the course.

I don’t think Andrew references this additional material in his course of week 2 in his video lessons. So it is probably best to study only his course material as presented in this Specialization, otherwise it can be confusing as the graded tests are only expecting students to know the course content.

Also, I have no idea where the “MLS course forum area “Resources” topic.” is located.

The lecture slides are not additional material. They’re the same slides he used in the lectures.

I’m confused. He presents continuously-running video lessons - not a sequential slide presentation that might be created from PowerPoint say.

I think it would be more useful if you were to reference the point in time in the elapsed time of his video presentation at which he explains what the loss function and cost function are instead of a slide number.

Perhaps @conscell can provide that.

I’m confused by your nomenclature.

Doesn’t z with a subscript identify only the raw output for one of the ten classes? Why do you attach a subscript denoting an output label for a particular training example?

Shouldn’t the denominator sum contain one term with the same z as the numerator term?

@ai_is_cool
The notation is mathematically equivalent to the one used in lectures.

z_k refers to the logit (raw score) assigned by the model for class k for a particular input example. So if you have 10 classes, you’ll have 10 logits: z_1, z_2, \dots, z_{10}.

y^{(i)} is the true class label for the i-th training example. Suppose y^{(i)}=7. Then z_{y^{(i)}} means the logit for the correct class, for the i-th example in this case it is z_7.

The equation for the loss is saying: “Take the logit for the correct class, exponentiate it, divide by the sum of exponentials of all logits (softmax denominator), take log and negate it.” The denominator includes the term z_{y^{(i)}} from the numerator because we sum over all classes, including the true class.

Let’s say for a training example (x^{(i)}, y^{(i)}) a neural network outputs logits z=[2.0,1.0,0.1] (in this example we have 3 classes, and here z_1 = 2.0, z_2 = 1.0, z_3 = 0.1). Suppose that the true class is class 2, i.e. y^{(i)}=2. Then:
numerator = e^{1.0},
denominator = e^{2.0} + e^{1.0} + e^{0.1},
loss = L^{(i)} = -\log \left( \frac{e^{1.0}}{e^{2.0} + e^{1.0} + e^{0.1}} \right).

1 Like

@ai_is_cool,
I am referring to the slides Prof. Ng used in his lectures. As you mentioned, unfortunately, there are no direct links from the course site to these materials. So, I think it is a good idea to include these materials to the course.

Yes, loss and cost are often used synonymously in code and documentation. In practice, when people say loss function, they almost always mean the thing you minimize whether per example or averaged, so it’s context-dependent.

1 Like

@ai_is_cool,
Here are the references to the video lectures:
slide 6 - Advanced Learning Algorithms → Week 2 → Training Details @ 9:00
slide 25 - Advanced Learning Algorithms → Week 2 → Softmax @ 10:00

1 Like

I don’t understand what you mean by “… equivalent…”. It is either the same or not the same. Which is it?

Andrew doesn’t use the notation z_{y^{(i)}} in his video presentations. Please can we adhere to Andrew’s notation in his course video presentations as he doesn’t instruct students to look at other material like “slides”?

Your last expression is an “evaluation” of the loss function and not the loss function as defined by Andrew.

People are erroneously attaching a context as the loss function and the cost function are different and not the same.

I read this from the a link provided in this thread regarding the “slides”…

Note: These slides are not regularly maintained and you might find missing topics and incorrect information when compared to the lecture videos. We try to fix the errors as soon as we can but we highly recommend adding your own notes to these slides.”

So I cannot trust the content in these slides as being correct and I would prefer that reference is made only to the content in the video presentation when attempting to explain what the cost function is of the softmax activation function.

“Equivalent” there means mathematically identical in behavior and outcome, even if written with different notation. The notation I used is a common way to define the loss function in a clean, compact way without having to define an entire forest of piecewise conditions.

Your last expression is an “evaluation” of the loss function and not the loss function as defined by Andrew.

Could you please be more specific and indicate what’s the difference?

People are erroneously attaching a context as the loss function and the cost function are different and not the same.

Here and here are examples of such documentation. You can submit pull requests in order to fix the documentation.

Did you find any discrepancies between those slides and video lectures?