Andrew talks a lot about the cost function for Softmax but he has only ever defined the loss function mathematically up to the video lesson Multiclass Classification in Week 2 of Advanced Learning Algorithms. He has never actually defined what the cost function is. So I have no idea what the cost function is for Softmax. He seems to get mixed up between the loss function and the cost function for Softmax.
Prof. Ng defines the loss function for a single training example (x^{(i)}, y^{(i)}) as \displaystyle L^{(i)}=â\log \left({e^{z_{y^{(i)}}} \over \sum_{k=1}^N âe^{z_k}}ââ \right) (slide 25),
where N is the number of classes; y^{(i)} is the correct label for example i; z_k are the raw outputs (logits) for each class k. This is called the cross-entropy loss with softmax activation.
Then, the cost function is the average of this loss over all m training examples: \displaystyle J(W, B) = {1 \over m} \sum_{i = 1}^m L^{(i)} (slide 6).
Logits are the raw outputs of a neural network before applying the softmax function.
In many machine learning frameworks, the terms loss function and cost function are used interchangeably, even though they technically refer to different things.
What are âslidesâ? I can only see continuous-play video lessons in the Week 2 course of Advanced Learning Algorithms.
Its confusing when a loss function is used interchangeably with a cost function as they are different functions and represent different things. Its a bit like saying cos(x) and sin(x) are used interchangeably.
âSlidesâ refers to the presentation slide deck that is used in the lectures. These can be found in the MLS course forum area âResourcesâ topic.
I donât think Andrew references this additional material in his course of week 2 in his video lessons. So it is probably best to study only his course material as presented in this Specialization, otherwise it can be confusing as the graded tests are only expecting students to know the course content.
Also, I have no idea where the âMLS course forum area âResourcesâ topic.â is located.
Iâm confused. He presents continuously-running video lessons - not a sequential slide presentation that might be created from PowerPoint say.
I think it would be more useful if you were to reference the point in time in the elapsed time of his video presentation at which he explains what the loss function and cost function are instead of a slide number.
Doesnât z with a subscript identify only the raw output for one of the ten classes? Why do you attach a subscript denoting an output label for a particular training example?
Shouldnât the denominator sum contain one term with the same z as the numerator term?
@ai_is_cool
The notation is mathematically equivalent to the one used in lectures.
z_k refers to the logit (raw score) assigned by the model for class k for a particular input example. So if you have 10 classes, youâll have 10 logits: z_1, z_2, \dots, z_{10}.
y^{(i)} is the true class label for the i-th training example. Suppose y^{(i)}=7. Then z_{y^{(i)}} means the logit for the correct class, for the i-th example in this case it is z_7.
The equation for the loss is saying: âTake the logit for the correct class, exponentiate it, divide by the sum of exponentials of all logits (softmax denominator), take log and negate it.â The denominator includes the term z_{y^{(i)}} from the numerator because we sum over all classes, including the true class.
Letâs say for a training example (x^{(i)}, y^{(i)}) a neural network outputs logits z=[2.0,1.0,0.1] (in this example we have 3 classes, and here z_1 = 2.0, z_2 = 1.0, z_3 = 0.1). Suppose that the true class is class 2, i.e. y^{(i)}=2. Then: numerator = e^{1.0}, denominator = e^{2.0} + e^{1.0} + e^{0.1}, loss = L^{(i)} = -\log \left( \frac{e^{1.0}}{e^{2.0} + e^{1.0} + e^{0.1}} \right).
@ai_is_cool,
I am referring to the slides Prof. Ng used in his lectures. As you mentioned, unfortunately, there are no direct links from the course site to these materials. So, I think it is a good idea to include these materials to the course.
Yes, loss and cost are often used synonymously in code and documentation. In practice, when people say loss function, they almost always mean the thing you minimize whether per example or averaged, so itâs context-dependent.
I donât understand what you mean by â⌠equivalentâŚâ. It is either the same or not the same. Which is it?
Andrew doesnât use the notation z_{y^{(i)}} in his video presentations. Please can we adhere to Andrewâs notation in his course video presentations as he doesnât instruct students to look at other material like âslidesâ?
Your last expression is an âevaluationâ of the loss function and not the loss function as defined by Andrew.
I read this from the a link provided in this thread regarding the âslidesââŚ
Note: These slides are not regularly maintained and you might find missing topics and incorrect information when compared to the lecture videos. We try to fix the errors as soon as we can but we highly recommend adding your own notes to these slides.â
So I cannot trust the content in these slides as being correct and I would prefer that reference is made only to the content in the video presentation when attempting to explain what the cost function is of the softmax activation function.
âEquivalentâ there means mathematically identical in behavior and outcome, even if written with different notation. The notation I used is a common way to define the loss function in a clean, compact way without having to define an entire forest of piecewise conditions.
Your last expression is an âevaluationâ of the loss function and not the loss function as defined by Andrew.
Could you please be more specific and indicate whatâs the difference?