Take the function f(x) = cos(x). This is a function of the variable x. However f(0.5) = cos(0.5) is an evaluation of f(x) at x = 0.5.
I prefer to use my time to learn from Andrew’s video presentation content on Advanced Learning Algorithms than spend time identifying discrepancies in material not advocated by Andrew.
I mean in terms of the cross-entropy loss.
It is easy to see that the two following definitions are mathematically equivalent:
\displaystyle L^{(i)} = -\log\left( {e^{z_{y^{(i)}}} \over \sum_{k=1}^N e^{z_k}} \right);
L^{(i)} = \cases{\displaystyle -\log {e^{z_1} \over e^{z_1} + e^{z_2} + \dots + e^{z_N}} {\ \ \rm if\ } y^{(i)} = 1 \cr \displaystyle -\log {e^{z_2} \over e^{z_1} + e^{z_2} + \dots + e^{z_N}} {\ \ \rm if\ } y^{(i)} = 2 \cr \hfil \vdots \hfil \cr \displaystyle -\log {e^{z_N} \over e^{z_1} + e^{z_2} + \dots + e^{z_N}} {\ \ \rm if\ } y^{(i)} = N \cr }.
It’s sad to hear that, because your attention to subtle details could definitely help improve those materials for the community.
Hello, @ai_is_cool,
Since I think you were looking for where Andrew explained something, the “slides” that are, as you found, not regularly maintained could serve as a set of text for us to search for video references, because we do not have a compilation of all Andrew’s transcripts in one place publicly.
With the help of those slides and a bit of my own effort, I found the following explanations by Andrew:
Please read the highlighted transcript, but on the screenshot you can find the time mark that roughly points to when he said those.
The key takeaway would be “by summing up the losses on all of the training examples … you then get the cost function”
In this 3-step framework, Andrew showed the general form of J which is in agreement with the transcript quoted above. Although it is unfortunate that he used binary classification instead of multiclass for the example of L, if we go to the next screenshot:
I think we can see that Andrew was reusing the same framework, only at this time he talked about multiclass. The same framework, including the general form of J, should apply here, too.
The cost being the sum of losses, as explained in the video from which the first screenshot here was extracted, is the definition universally used by these courses for all of squared loss, logistic loss, and the categorical crossentropy loss (which makes use of softmax output).
With the help of the “slides”, it took me less than 15 minutes to get these screenshots. I hoped I could find a screen that showed specifically how, mathematically, the cost function for multiclass problem is defined, but I might have missed that.
Cheers,
Raymond
All I really want to know is what is the mathematical expression of the loss function and the mathematical expression of the cost function for the softmax activation function as Andrew made a mistake when he was talking about the loss function and called it the cost function instead.
It’s a pity that you didn’t explain this notation earlier involving e^{z_{y^{(i)}}}
You haven’t completed the conditional expression;
if y^{(i)}
saying what must be true for each of the N expressions.
Do you mean;
if y^{(i)} = 1
if y^{(i)} = 2
etc
If so, do you mean that the loss function takes on one of N possible values for each input training example X(i)?
Not exactly.
I would encourage you to contact those contributors who were involved in producing the slides to review their own contributions to these slide materials as they are best-placed to make corrections and learn from their experience to adopt a more attention-to-detail approach to authoring technical content.
I’ve completed the conditional expression. It seems like your device cuts off the right part of the expression. This is how my reply should look:
The loss function takes N inputs, one true label and returns one value.
It’s a pity that you didn’t explain this notation earlier involving e^{z_{y^{(i)}}}.
Apology for that. I was assuming that it’s pretty obvious.
Not exactly.
Well spotted! The common reduction mechanism is the average.
UPD: But the sum can be used as well.
Best to be explicit and not make assumptions where mathematics is concerned I find.
Where is the one true label output variable in your loss function definition? I can only see the N logit values.
Where is the one true label output variable in your loss function definition?
It is y^{(i)} in the conditional (if y^{(i)} = 1, if y^{(i)} = 2, \dots, if y^{(i)} = N) expressions.
So when calling the loss function in Python code does one pass the extra variable y^{(i)} as well as z_k, for k = {1…N}?
Yes, y^{(i)} as well as z_k should be the argument of the loss function:
y_true = [2] # y^{(i)}
y_pred = [[0.1, 0.8, 0.1]] # z_k
# Using 'auto'/'sum_over_batch_size' reduction type.
scce = keras.losses.SparseCategoricalCrossentropy()
scce(y_true, y_pred)
What does this comment mean;
As I mentioned before, machine learning frameworks use the term loss function for the cost function as well. Let’s have a closer look:
tf.keras.losses.SparseCategoricalCrossentropy(
from_logits=False,
ignore_class=None,
reduction='sum_over_batch_size',
name='sparse_categorical_crossentropy'
)
The default value for the reduction
argument is “sum_over_batch_size”, which means the cost function (the average of the loss function) for the batch will be computed. For one training example it will act as the loss function.