I am puzzling over this section of course 3
The red arrow indicates that loss involves the whole target value
The purple arrow indicates that loss only involves a the target as a toggle switch.
I am puzzling over this section of course 3
The red arrow indicates that loss involves the whole target value
The purple arrow indicates that loss only involves a the target as a toggle switch.
But remember that the point is that every y^{(i)} value must be either 1 or 0 by definition, right? They are the “labels” for the data samples and every sample is either a “yes” or a “no”. So the loss is defined for every sample by selecting the relevant one of those two formulas. The point is that your goal for the f_{w,b}(x^{(i)}) value is different depending on whether the y^{(i)} value is 1 or 0.
Or maybe am I just missing your point.
Thank you
I understand the target will be either a 1 or a 0
Are you saying that the prediction will also be either a 1 or a 0?
So the loss will vary between 1, -1 , 0 ?
The prediction is f_{w,b}(x^{(i)}) and it can take any value between 0 and 1. It is between 0 and 1 because we use the Sigmoid function.
Right! Which means the loss will be between 0 and +\infty.
Here’s a thread from DLS which discusses the cross entropy loss in more detail and shows the graph of the log function between 0 and 1.
Thank you.
What scenario is the loss = + \infty ?
If the y label is 1 and the \hat{y} prediction value is exactly 0. Or when the y label is 0 and the \hat{y} prediction value is exactly 1. Because as you can see from the two formulas, in either of those cases you get -log(0) as the loss and log(0) = -\infty. Of course notice that from a mathematical standpoint, the output of sigmoid is never exactly 0 or 1, it only approaches them asymptotically. So you could say that the loss is never really infinite in mathematical terms, but we are doing everything in floating point here so the values can actually “saturate” and round to 0 or 1. If that actually happens, you end up with NaN (Not a Number) as the cost value because of the rules for propagating infinite values in IEEE 754 floating point.
Sorry, I may be using different notation than they use in MLS. In DLS, the output of a model for a particular sample is called \hat{y}. So to put both the MLS and DLS notation together we have:
\hat{y}^{(i)} = f_{w,b}(x^{(i)})
The more mathematically correct way to say this is:
0 < loss < \infty
so the loss is never infinite in mathematical terms, but it can be arbitrarily large. You can always make the prediction worse (farther from the correct answer), although that is never the goal of course. Note that the loss can never be exactly zero either, but that is fine. The way we interpret the predictions is by comparing them to 0.5. If \hat{y} is > 0.5, then we consider that a “yes” answer and “no” otherwise. So the model can achieve 100% accuracy w.r.t. the labels on the samples without the loss actually being 0.
The point of the loss function is that it is the basis for “back propagation” which allows the model to be trained to produce correct answers. The derivatives of the loss tell the algorithm which direction to move the parameter values in order to improve the results and that’s where the learning happens.
Thank you.
What are MLS and DLS?
I notice on my calculator Log(0) gives a domain error. I guess that is another way of saying NaN ?
MLS is the Machine Learning Specialization, which it looks like you are taking. DLS is the Deep Learning Specialization, which is perhaps the next set of courses to take after MLS. MLS gives you a survey of lots of different types of ML algorithms. DLS focusses specifically on Deep Neural Networks and really goes into a lot of detail on how they work, what kind of problems they can solve and how to build such solutions.
My guess is what your calculator means by “domain error” is that they consider the “domain” of the function log
to be only positive numbers (strictly greater than zero). So 0 is not in the domain of that function as they define it. “domain” is the term mathematicians use for the set of all possible inputs to a given function. The “range” of the function is that set of all possible output values from the function.